[logs] regex-less parsing of messages

From: Anton Chuvakin (anton@private)
Date: Sun Dec 04 2005 - 15:58:31 PST


Its time for me to come out of lurking again :-) Here is the thing:
when people want to analyze logs, the first stage is often to tokenize
(or "parse" as some say) the logs to some manageable format (XML
anyone?) for analysis of RDBMS storage.

However, if logs are very diverse and lack a format in the first
place, the above becomes a mammoth task, since one has to write a lot
of ugly regular expressions. In addition, if a message with a new
format comes out of the woodwork, a new regex needs to be created. Or,
a silly generic regex is used (such as the one that only tokenizes the
date and the device name from a Unix syslog message).

What are the possible ways around it? From what I know, none of the
easy or fun ones. One might try to use clustering (such as 'slct') to
try to identify the variable and stable parts of messages from a bulk
of them, but that still does not make them tokenized. Or, one can try
to create a "brute forcing parser" that will try to guess, for
example, that a part of message that contains from 1 to 3 numbers in 4
quads with dots is really an IP address. However, it will likely fails
more often than not, and it is kinda hard :-) for it to tell a
username from a password (both are strings). Or, one can do analysis
without tokenizing the logs into a common format, such as with Bayes,
by treating them as pretty much English text...

So, any more ideas from the group on handling it?

Anton Chuvakin, Ph.D., GCIA, GCIH, GCFA
LogAnalysis mailing list

This archive was generated by hypermail 2.1.3 : Sun Dec 04 2005 - 22:43:13 PST