[logs] Re: regex-less parsing of messages

From: todd.glassey@private
Date: Wed Dec 07 2005 - 12:47:02 PST


 -------------- Original message ----------------------
From: Christina Noren <cfrln@private>
> Speaking from Splunk...
> 
> This problem of needing to build and maintain a big library of  
> regexes to analyze logs centrally is one we're trying to end run, so  
> thanks Todd for bringing us into the conversation.
> 
> We agree with Frank that getting common XML standards is pretty  
> unlikely across the broad range of log sources people need to correlate.

On the other hand - the real win is in ones that can be used to meet regulatory control requirements for SOX and other legal process requirements.

> 
> We've instead built a series of universal processors that find and  
> normalize timestamps in any format, then tokenize everything in each  
> event, and classify new sources and events based on patterns and  
> grammatical structure in the event. We put off all of the semantics  
> till search time so we don't need to worry about mapping "deny"  
> "reject" and other variants of the same action to a common value. I'm  
> oversimplifying a more complex set of algorithms for the sake of a  
> short message.
> 
> Users are able to put in log sources we've never seen before and have  
> them handled by the same algorithms as everything else.
> 
> Then, instead of a structured relational db, we put everything into a  
> rich, dense search index behind a simple search interface that  
> provides results to most searches in seconds. This has the nice side  
> effect of making ad hoc access to the logs a lot easier than needing  
> to form a SQL style query.
> 
> This works pretty well for use cases like tracing an email message  
> through different sendmail, antispam and other events and other  
> investigative/troubleshooting scenarios. There's really no reason to  
> write a regex to parse sendmail's different message formats into a  
> structured schema if you're going to search for an email address and  
> time, then follow that event based on message id and other content of  
> that event. We have some interesting accelerators for following the  
> correlation, like a "related" feature that looks for the connections  
> based on time and value.
> 
> - Christina
> 
> p.s. you can download Splunk free at www.splunk.com
> 
> 
> 
> On Dec 6, 2005, at 8:13 AM, todd.glassey@private wrote:
> 
> > We use SPLUNK for exactly this.
> >
> > Todd
> >  -------------- Original message ----------------------
> > From: "Solomon, Frank" <frank@private>
> >
> >> Jason, your example certainly struck a chord.  We haven't even  
> >> begun to
> >> put our mail logs into our central log server because of the  
> >> technical
> >> challenges that would pose.  And yet, we get asked the same sort of
> >> questions which require a highly trained person to probe through the
> >> heterogeneous mail log files and trace the path of some errant  
> >> envelope
> >> that may or may not actually exist.  It is not pretty; part of the  
> >> price
> >> we pay for having to accommodate multiple mail systems, vendors and
> >> standards.
> >>
> >> Our standing joke is:  "That's the nice thing about standards,  
> >> there are
> >> so many to choose from and everyone can have their own."  So,  
> >> "sendmail"
> >> has its "standard" log format and "Exchange" has its "standard" log
> >> format, and "Novell" has its "standard" log format, etc.  I saw an
> >> article recently describing the new "logging standard" that Microsoft
> >> was about to introduce in their latest OS.  Well that will certainly
> >> clear things up!  I'm sure all their competitors will rush to  
> >> implement
> >> compatible systems.  Don't get me wrong, I laud Microsoft's  
> >> attempt to
> >> enforce programmer discipline.
> >>
> >> In case you're interested in the MS stuff:
> >> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/ 
> >> wes/wes
> >> /about_the_windows_event_log.asp
> >>
> >> <dreaming>
> >> Certainly, the first challenge in being able to analyze data is  
> >> getting
> >> it into a common format with a common symbolic representation of the
> >> underlying information.  Since we cannot count on the energy and
> >> discipline of the programmers that write the log-generating programs,
> >> that energy must be invested in and discipline must be enforced by  
> >> the
> >> log collection mechanism.  It's becoming obvious to me that the  
> >> blanket
> >> approach of collecting everything on the off chance that some  
> >> auditor or
> >> forensic specialist in the future might be able to make sense of  
> >> it, is
> >> a waste of resources.  That implies that the requirements for what  
> >> needs
> >> to be logged could be set at the collecting end and that somehow  
> >> those
> >> requirements need to be communicated to the source of the messages to
> >> make sure that the required messages exist and are coded  
> >> appropriately
> >> (which they won't be).
> >> </dreaming>
> >>
> >> I know, I'm dreaming: there's no choice but to continue to collect  
> >> tons
> >> of ore and hope to glean an ounce of silver from it every once in a
> >> while.  And besides, those old log CD's make nifty tree ornaments.
> >>
> >> John Moehrke mentioned that his organization was making the  
> >> attempt to
> >> define the standards for the events at the beginning.  To quote:  "We
> >> thus will be sending the experts in log analysis an already  
> >> manageable
> >> format."  That's a great idea, but it suffers from the same standards
> >> problem I've mentioned:  everybody's likely to have their own (maybe
> >> someday the only industry will be healthcare, but not yet).  And  
> >> after
> >> looking at the RFC, I can't imagine that good things will come of the
> >> burden this will place on the infrastructure if the logging rate  
> >> is very
> >> high.  Can you imagine the "sendmail" guys wrapping xml around the  
> >> mail
> >> logs?  Or, all the mail system vendors agreeing on a common xml  
> >> schema
> >> for their mail logs?  Yeah, it might happen.
> >>
> >> Personally, I'm glad that syslog uses udp.
> >>
> >> Sorry, I've rambled entirely too long, I'll go back to merely  
> >> listening.
> >>
> >> Frank Solomon
> >> University of Kentucky
> >> Lead Systems Programmer, Enterprise Systems
> >> http://www.franksolomon.net
> >> "If you give someone a program, you will frustrate them for a day; if
> >> you teach them how to program, you will frustrate them for a  
> >> lifetime."
> >> --Anonymous
> >>
> >>
> >> -----Original Message-----
> >> [mailto:loganalysis-bounces+sysfrank=uky.edu@private] On  
> >> Behalf
> >> Of Jason Haar
> >> Sent: Monday, December 05, 2005 3:15 PM
> >>
> >> . . .snip. . .
> >>
> >> Boring, everyday example:  These days (due to the horrors of antispam
> >> systems) internal users routinely ring the helpdesk and ask  
> >> "Customer YY
> >> sent me an email and I never got it. What happened?". To figure  
> >> that out
> >> involves converting what you can learn about customer YY into DNS
> >> records and IP addresses, then tracking any related connections as  
> >> they
> >> hits the edge of our Internet link. Where it first meets our RBL  
> >> checks,
> >> then flows through AV and antispam systems, then through a couple  
> >> more
> >> internal mail relays before hitting our end mail servers. We have  
> >> logs
> >> all merged together from all those systems, but frankly, I am  
> >> still the
> >> only one who can link all those events together. And my attempts at
> >> turning that eyeballing into a program have failed so far. And that's
> >> only one example.
> >>
> >> . . .
> >> _______________________________________________
> >> LogAnalysis mailing list
> >> LogAnalysis@private
> >> http://lists.shmoo.com/mailman/listinfo/loganalysis
> >>
> >
> >
> > _______________________________________________
> > LogAnalysis mailing list
> > LogAnalysis@private
> > http://lists.shmoo.com/mailman/listinfo/loganalysis
> >
> 
> _______________________________________________
> LogAnalysis mailing list
> LogAnalysis@private
> http://lists.shmoo.com/mailman/listinfo/loganalysis


_______________________________________________
LogAnalysis mailing list
LogAnalysis@private
http://lists.shmoo.com/mailman/listinfo/loganalysis



This archive was generated by hypermail 2.1.3 : Wed Dec 07 2005 - 18:23:18 PST