[logs] Re: regex-less parsing of messages

From: Christina Noren (cfrln@private)
Date: Tue Dec 06 2005 - 18:26:49 PST


Speaking from Splunk...

This problem of needing to build and maintain a big library of  
regexes to analyze logs centrally is one we're trying to end run, so  
thanks Todd for bringing us into the conversation.

We agree with Frank that getting common XML standards is pretty  
unlikely across the broad range of log sources people need to correlate.

We've instead built a series of universal processors that find and  
normalize timestamps in any format, then tokenize everything in each  
event, and classify new sources and events based on patterns and  
grammatical structure in the event. We put off all of the semantics  
till search time so we don't need to worry about mapping "deny"  
"reject" and other variants of the same action to a common value. I'm  
oversimplifying a more complex set of algorithms for the sake of a  
short message.

Users are able to put in log sources we've never seen before and have  
them handled by the same algorithms as everything else.

Then, instead of a structured relational db, we put everything into a  
rich, dense search index behind a simple search interface that  
provides results to most searches in seconds. This has the nice side  
effect of making ad hoc access to the logs a lot easier than needing  
to form a SQL style query.

This works pretty well for use cases like tracing an email message  
through different sendmail, antispam and other events and other  
investigative/troubleshooting scenarios. There's really no reason to  
write a regex to parse sendmail's different message formats into a  
structured schema if you're going to search for an email address and  
time, then follow that event based on message id and other content of  
that event. We have some interesting accelerators for following the  
correlation, like a "related" feature that looks for the connections  
based on time and value.

- Christina

p.s. you can download Splunk free at www.splunk.com



On Dec 6, 2005, at 8:13 AM, todd.glassey@private wrote:

> We use SPLUNK for exactly this.
>
> Todd
>  -------------- Original message ----------------------
> From: "Solomon, Frank" <frank@private>
>
>> Jason, your example certainly struck a chord.  We haven't even  
>> begun to
>> put our mail logs into our central log server because of the  
>> technical
>> challenges that would pose.  And yet, we get asked the same sort of
>> questions which require a highly trained person to probe through the
>> heterogeneous mail log files and trace the path of some errant  
>> envelope
>> that may or may not actually exist.  It is not pretty; part of the  
>> price
>> we pay for having to accommodate multiple mail systems, vendors and
>> standards.
>>
>> Our standing joke is:  "That's the nice thing about standards,  
>> there are
>> so many to choose from and everyone can have their own."  So,  
>> "sendmail"
>> has its "standard" log format and "Exchange" has its "standard" log
>> format, and "Novell" has its "standard" log format, etc.  I saw an
>> article recently describing the new "logging standard" that Microsoft
>> was about to introduce in their latest OS.  Well that will certainly
>> clear things up!  I'm sure all their competitors will rush to  
>> implement
>> compatible systems.  Don't get me wrong, I laud Microsoft's  
>> attempt to
>> enforce programmer discipline.
>>
>> In case you're interested in the MS stuff:
>> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/ 
>> wes/wes
>> /about_the_windows_event_log.asp
>>
>> <dreaming>
>> Certainly, the first challenge in being able to analyze data is  
>> getting
>> it into a common format with a common symbolic representation of the
>> underlying information.  Since we cannot count on the energy and
>> discipline of the programmers that write the log-generating programs,
>> that energy must be invested in and discipline must be enforced by  
>> the
>> log collection mechanism.  It's becoming obvious to me that the  
>> blanket
>> approach of collecting everything on the off chance that some  
>> auditor or
>> forensic specialist in the future might be able to make sense of  
>> it, is
>> a waste of resources.  That implies that the requirements for what  
>> needs
>> to be logged could be set at the collecting end and that somehow  
>> those
>> requirements need to be communicated to the source of the messages to
>> make sure that the required messages exist and are coded  
>> appropriately
>> (which they won't be).
>> </dreaming>
>>
>> I know, I'm dreaming: there's no choice but to continue to collect  
>> tons
>> of ore and hope to glean an ounce of silver from it every once in a
>> while.  And besides, those old log CD's make nifty tree ornaments.
>>
>> John Moehrke mentioned that his organization was making the  
>> attempt to
>> define the standards for the events at the beginning.  To quote:  "We
>> thus will be sending the experts in log analysis an already  
>> manageable
>> format."  That's a great idea, but it suffers from the same standards
>> problem I've mentioned:  everybody's likely to have their own (maybe
>> someday the only industry will be healthcare, but not yet).  And  
>> after
>> looking at the RFC, I can't imagine that good things will come of the
>> burden this will place on the infrastructure if the logging rate  
>> is very
>> high.  Can you imagine the "sendmail" guys wrapping xml around the  
>> mail
>> logs?  Or, all the mail system vendors agreeing on a common xml  
>> schema
>> for their mail logs?  Yeah, it might happen.
>>
>> Personally, I'm glad that syslog uses udp.
>>
>> Sorry, I've rambled entirely too long, I'll go back to merely  
>> listening.
>>
>> Frank Solomon
>> University of Kentucky
>> Lead Systems Programmer, Enterprise Systems
>> http://www.franksolomon.net
>> "If you give someone a program, you will frustrate them for a day; if
>> you teach them how to program, you will frustrate them for a  
>> lifetime."
>> --Anonymous
>>
>>
>> -----Original Message-----
>> [mailto:loganalysis-bounces+sysfrank=uky.edu@private] On  
>> Behalf
>> Of Jason Haar
>> Sent: Monday, December 05, 2005 3:15 PM
>>
>> . . .snip. . .
>>
>> Boring, everyday example:  These days (due to the horrors of antispam
>> systems) internal users routinely ring the helpdesk and ask  
>> "Customer YY
>> sent me an email and I never got it. What happened?". To figure  
>> that out
>> involves converting what you can learn about customer YY into DNS
>> records and IP addresses, then tracking any related connections as  
>> they
>> hits the edge of our Internet link. Where it first meets our RBL  
>> checks,
>> then flows through AV and antispam systems, then through a couple  
>> more
>> internal mail relays before hitting our end mail servers. We have  
>> logs
>> all merged together from all those systems, but frankly, I am  
>> still the
>> only one who can link all those events together. And my attempts at
>> turning that eyeballing into a program have failed so far. And that's
>> only one example.
>>
>> . . .
>> _______________________________________________
>> LogAnalysis mailing list
>> LogAnalysis@private
>> http://lists.shmoo.com/mailman/listinfo/loganalysis
>>
>
>
> _______________________________________________
> LogAnalysis mailing list
> LogAnalysis@private
> http://lists.shmoo.com/mailman/listinfo/loganalysis
>

_______________________________________________
LogAnalysis mailing list
LogAnalysis@private
http://lists.shmoo.com/mailman/listinfo/loganalysis



This archive was generated by hypermail 2.1.3 : Tue Dec 06 2005 - 18:54:25 PST