[logs] Re: regex-less parsing of messages

John.Moehrke@private

Healthcare has decided to define the security-audit-events and the XML
schema to describe them. We thus will be sending the experts in
loganalysis an already manageable format. We figure that this is a great
opportunity to continue to focus on what we do best (healthcare
algorithms), while you focus on what you do best (analyze logs). 

The basis of our XML schema can be found in RFC 3881. This was further
defined by one of our healthcare specific standards groups (DICOM), and
further defined in Profiles from a customer/vendor organization
"Integrating the Healthcare Enterprise" (IHE). The details are buried
inside large documents, as that happens to be the way these groups
publish their normative documents. They are all freely available.

We have had some problems with finding a good transport. We have
specified BSD-SYSLOG, and Reliable-SYSLOG-COOKED.  The problems are that
our messages are often 1-2K in size with some getting close to 4K. The
second problem is finding complete implementations of
Reliable-SYSLOG-COOKED. 

John

> -----Original Message-----
> From: 
> loganalysis-bounces+john.moehrke=med.ge.com@private 
> [mailto:loganalysis-bounces+john.moehrke=med.ge.com@private
> o.com] On Behalf Of Anton Chuvakin
> Sent: Sunday, December 04, 2005 5:59 PM
> To: LogAnalysis@private
> Subject: [logs] regex-less parsing of messages
> 
> All,
> 
> Its time for me to come out of lurking again :-) Here is the thing:
> when people want to analyze logs, the first stage is often to tokenize
> (or "parse" as some say) the logs to some manageable format (XML
> anyone?) for analysis of RDBMS storage.
> 
> However, if logs are very diverse and lack a format in the first
> place, the above becomes a mammoth task, since one has to write a lot
> of ugly regular expressions. In addition, if a message with a new
> format comes out of the woodwork, a new regex needs to be created. Or,
> a silly generic regex is used (such as the one that only tokenizes the
> date and the device name from a Unix syslog message).
> 
> What are the possible ways around it? From what I know, none of the
> easy or fun ones. One might try to use clustering (such as 'slct') to
> try to identify the variable and stable parts of messages from a bulk
> of them, but that still does not make them tokenized. Or, one can try
> to create a "brute forcing parser" that will try to guess, for
> example, that a part of message that contains from 1 to 3 numbers in 4
> quads with dots is really an IP address. However, it will likely fails
> more often than not, and it is kinda hard :-) for it to tell a
> username from a password (both are strings). Or, one can do analysis
> without tokenizing the logs into a common format, such as with Bayes,
> by treating them as pretty much English text...
> 
> So, any more ideas from the group on handling it?
> 
> Best,
> --
> Anton Chuvakin, Ph.D., GCIA, GCIH, GCFA
>          http://www.chuvakin.org
>     http://www.securitywarrior.com
> _______________________________________________
> LogAnalysis mailing list
> LogAnalysis@private
> http://lists.shmoo.com/mailman/listinfo/loganalysis
> 
_______________________________________________
LogAnalysis mailing list
LogAnalysis@private
http://lists.shmoo.com/mailman/listinfo/loganalysis