[logs] Re: regex-less parsing of messages

From: Edward Sargisson (edward.j.sargisson@private)
Date: Wed Dec 07 2005 - 13:06:47 PST


While we're discussing XML formats I thought I'd mention some experience 
I've had.
I've been working with IBM's Common Base Event XML format. [1]
It's been morphed into the OASIS standard WSDM Management Using Web 
Services v1.0 (WSDM-MUWS) [2] .

It's been fairly useful to us because the Eclipse client (using Eclipse 
TPTP)  has good support for the format which made implementing a viewer 
very easy. 

We used Visual Basic to generate the XML straight from the source (as the 
source app is in Visual Basic - grrr) and sends the message over WebSphere 
MQ to a J2EE message driven bean which writes the event to the database 
and publishes it to any subscribers using Publish/Subscribe.

The CBE format holds most of the common information you'd expect - 
creation time, severity, priority, source, etc and has placeholders for 
arbitrary XML data. It doesn't fully solve the problem in this thread - 
i.e. representing the actual event contents (i.e. not creation time, 
severity, etc) in a standard way to allow analysis.

Links:
[1] Search www.ibm.com/developerworks for Common Base Event or start at 
http://www-128.ibm.com/developerworks/webservices/library/specification/ws-cbe/
[2] http://www.oasis-open.org/specs/index.php



Edward Sargisson BSc, BCom
Consultant
IBM Business Consulting Services
Wellington, New Zealand
DDI: + 64-4-462-3586, Mob: + 64-21-254-8927
P O Box 38 993, Wellington, NEW ZEALAND
edward.j.sargisson@private





todd.glassey@private 
Sent by: 
loganalysis-bounces+edward.j.sargisson=nz1.ibm.com@private
08/12/2005 05:37

To
Christina Noren <cfrln@private>, LogAnalysis@private
cc

Subject
[logs] Re: regex-less parsing of messages






Christina - FYI
I am working on a log management practice statement for the use of SPLUNK 
to address log management issues inside of ITIL, COBIT v4, and the updated 
ISO17799/20001:20005 documents. The intent is that for the current client 
I have now, to be able to use SPLUNK Pro as the basis of a logging 
management and event detection regimen for their automated and periodic 
controls.

This makes SPLUNK Pro totally good to go for meeting SOX and compliance in 
2CFR/211CFR type environments.

Todd
 -------------- Original message ----------------------
From: Christina Noren <cfrln@private>
> Speaking from Splunk...
> 
> This problem of needing to build and maintain a big library of 
> regexes to analyze logs centrally is one we're trying to end run, so 
> thanks Todd for bringing us into the conversation.
> 
> We agree with Frank that getting common XML standards is pretty 
> unlikely across the broad range of log sources people need to correlate.
> 
> We've instead built a series of universal processors that find and 
> normalize timestamps in any format, then tokenize everything in each 
> event, and classify new sources and events based on patterns and 
> grammatical structure in the event. We put off all of the semantics 
> till search time so we don't need to worry about mapping "deny" 
> "reject" and other variants of the same action to a common value. I'm 
> oversimplifying a more complex set of algorithms for the sake of a 
> short message.
> 
> Users are able to put in log sources we've never seen before and have 
> them handled by the same algorithms as everything else.
> 
> Then, instead of a structured relational db, we put everything into a 
> rich, dense search index behind a simple search interface that 
> provides results to most searches in seconds. This has the nice side 
> effect of making ad hoc access to the logs a lot easier than needing 
> to form a SQL style query.
> 
> This works pretty well for use cases like tracing an email message 
> through different sendmail, antispam and other events and other 
> investigative/troubleshooting scenarios. There's really no reason to 
> write a regex to parse sendmail's different message formats into a 
> structured schema if you're going to search for an email address and 
> time, then follow that event based on message id and other content of 
> that event. We have some interesting accelerators for following the 
> correlation, like a "related" feature that looks for the connections 
> based on time and value.
> 
> - Christina
> 
> p.s. you can download Splunk free at www.splunk.com
> 
> 
> 
> On Dec 6, 2005, at 8:13 AM, todd.glassey@private wrote:
> 
> > We use SPLUNK for exactly this.
> >
> > Todd
> >  -------------- Original message ----------------------
> > From: "Solomon, Frank" <frank@private>
> >
> >> Jason, your example certainly struck a chord.  We haven't even 
> >> begun to
> >> put our mail logs into our central log server because of the 
> >> technical
> >> challenges that would pose.  And yet, we get asked the same sort of
> >> questions which require a highly trained person to probe through the
> >> heterogeneous mail log files and trace the path of some errant 
> >> envelope
> >> that may or may not actually exist.  It is not pretty; part of the 
> >> price
> >> we pay for having to accommodate multiple mail systems, vendors and
> >> standards.
> >>
> >> Our standing joke is:  "That's the nice thing about standards, 
> >> there are
> >> so many to choose from and everyone can have their own."  So, 
> >> "sendmail"
> >> has its "standard" log format and "Exchange" has its "standard" log
> >> format, and "Novell" has its "standard" log format, etc.  I saw an
> >> article recently describing the new "logging standard" that Microsoft
> >> was about to introduce in their latest OS.  Well that will certainly
> >> clear things up!  I'm sure all their competitors will rush to 
> >> implement
> >> compatible systems.  Don't get me wrong, I laud Microsoft's 
> >> attempt to
> >> enforce programmer discipline.
> >>
> >> In case you're interested in the MS stuff:
> >> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/ 
> >> wes/wes
> >> /about_the_windows_event_log.asp
> >>
> >> <dreaming>
> >> Certainly, the first challenge in being able to analyze data is 
> >> getting
> >> it into a common format with a common symbolic representation of the
> >> underlying information.  Since we cannot count on the energy and
> >> discipline of the programmers that write the log-generating programs,
> >> that energy must be invested in and discipline must be enforced by 
> >> the
> >> log collection mechanism.  It's becoming obvious to me that the 
> >> blanket
> >> approach of collecting everything on the off chance that some 
> >> auditor or
> >> forensic specialist in the future might be able to make sense of 
> >> it, is
> >> a waste of resources.  That implies that the requirements for what 
> >> needs
> >> to be logged could be set at the collecting end and that somehow 
> >> those
> >> requirements need to be communicated to the source of the messages to
> >> make sure that the required messages exist and are coded 
> >> appropriately
> >> (which they won't be).
> >> </dreaming>
> >>
> >> I know, I'm dreaming: there's no choice but to continue to collect 
> >> tons
> >> of ore and hope to glean an ounce of silver from it every once in a
> >> while.  And besides, those old log CD's make nifty tree ornaments.
> >>
> >> John Moehrke mentioned that his organization was making the 
> >> attempt to
> >> define the standards for the events at the beginning.  To quote:  "We
> >> thus will be sending the experts in log analysis an already 
> >> manageable
> >> format."  That's a great idea, but it suffers from the same standards
> >> problem I've mentioned:  everybody's likely to have their own (maybe
> >> someday the only industry will be healthcare, but not yet).  And 
> >> after
> >> looking at the RFC, I can't imagine that good things will come of the
> >> burden this will place on the infrastructure if the logging rate 
> >> is very
> >> high.  Can you imagine the "sendmail" guys wrapping xml around the 
> >> mail
> >> logs?  Or, all the mail system vendors agreeing on a common xml 
> >> schema
> >> for their mail logs?  Yeah, it might happen.
> >>
> >> Personally, I'm glad that syslog uses udp.
> >>
> >> Sorry, I've rambled entirely too long, I'll go back to merely 
> >> listening.
> >>
> >> Frank Solomon
> >> University of Kentucky
> >> Lead Systems Programmer, Enterprise Systems
> >> http://www.franksolomon.net
> >> "If you give someone a program, you will frustrate them for a day; if
> >> you teach them how to program, you will frustrate them for a 
> >> lifetime."
> >> --Anonymous
> >>
> >>
> >> -----Original Message-----
> >> [mailto:loganalysis-bounces+sysfrank=uky.edu@private] On 
> >> Behalf
> >> Of Jason Haar
> >> Sent: Monday, December 05, 2005 3:15 PM
> >>
> >> . . .snip. . .
> >>
> >> Boring, everyday example:  These days (due to the horrors of antispam
> >> systems) internal users routinely ring the helpdesk and ask 
> >> "Customer YY
> >> sent me an email and I never got it. What happened?". To figure 
> >> that out
> >> involves converting what you can learn about customer YY into DNS
> >> records and IP addresses, then tracking any related connections as 
> >> they
> >> hits the edge of our Internet link. Where it first meets our RBL 
> >> checks,
> >> then flows through AV and antispam systems, then through a couple 
> >> more
> >> internal mail relays before hitting our end mail servers. We have 
> >> logs
> >> all merged together from all those systems, but frankly, I am 
> >> still the
> >> only one who can link all those events together. And my attempts at
> >> turning that eyeballing into a program have failed so far. And that's
> >> only one example.
> >>
> >> . . .
> >> _______________________________________________
> >> LogAnalysis mailing list
> >> LogAnalysis@private
> >> http://lists.shmoo.com/mailman/listinfo/loganalysis
> >>
> >
> >
> > _______________________________________________
> > LogAnalysis mailing list
> > LogAnalysis@private
> > http://lists.shmoo.com/mailman/listinfo/loganalysis
> >
> 


_______________________________________________
LogAnalysis mailing list
LogAnalysis@private
http://lists.shmoo.com/mailman/listinfo/loganalysis




picture
_______________________________________________
LogAnalysis mailing list
LogAnalysis@private
http://lists.shmoo.com/mailman/listinfo/loganalysis



This archive was generated by hypermail 2.1.3 : Wed Dec 07 2005 - 18:17:57 PST