[logs] Re: regex-less parsing of messages

frank@private

Ya know, I like this "back-to-basics" sort of question.  I keep telling
myself, that the reason the "log analysis" problem is so hard, is that
we're not really asking the right question:

"I don't know what I want to see, but I will know it when I see it."

Is a vague programming specification and will always lead to a vague and
unsatisfying program.

Since our central syslog server went live in September of 2001 we have
collected 2.9 x 10^9 log messages, dutifully burned to CD's or DVD's
with an online summary that basically consists of merely how many were
collected per a particular time period by server and classified by
Facility and Priority.  Yes, we can run SQL queries against the last
600GB that were collected that parses through the text part of the
messages, assuming you know what you're looking for, that's somewhat
useful.  And yes, we've got pretty web screens that show you the last
100 or 1000 or 10000 rows collected or allow you to drill down into the
online data.  And, we've made attempts at notifying folks when critical
messages are logged.  But, is that what is meant by analysis?

Our logs have always been much more useful in answering post-mortem
questions than in actual detection of problems.  I question the cost vs.
benefit of dedicating enough resources to parse and store data that has
such low information content.  And I think we're being relatively
selective about what we log.  I can't imagine having to sift through all
the sub-threshold data from our firewalls for example.

I remember, early on, Tina asking for examples of what kinds of messages
we look for. . .very few answers.  Perhaps that tells us something about
how we're approaching this.

Frank

Frank Solomon
University of Kentucky
Lead Systems Programmer, Enterprise Systems
http://www.franksolomon.net
"If you give someone a program, you will frustrate them for a day; if
you teach them how to program, you will frustrate them for a lifetime."
--Anonymous

-----Original Message-----
From: loganalysis-bounces+sysfrank=uky.edu@private
[mailto:loganalysis-bounces+sysfrank=uky.edu@private] On Behalf
Of Anton Chuvakin
Sent: Sunday, December 04, 2005 6:59 PM
To: LogAnalysis@private
Subject: [logs] regex-less parsing of messages

All,

Its time for me to come out of lurking again :-) Here is the thing:
when people want to analyze logs, the first stage is often to tokenize
(or "parse" as some say) the logs to some manageable format (XML
anyone?) for analysis of RDBMS storage.

However, if logs are very diverse and lack a format in the first
place, the above becomes a mammoth task, since one has to write a lot
of ugly regular expressions. In addition, if a message with a new
format comes out of the woodwork, a new regex needs to be created. Or,
a silly generic regex is used (such as the one that only tokenizes the
date and the device name from a Unix syslog message).

What are the possible ways around it? From what I know, none of the
easy or fun ones. One might try to use clustering (such as 'slct') to
try to identify the variable and stable parts of messages from a bulk
of them, but that still does not make them tokenized. Or, one can try
to create a "brute forcing parser" that will try to guess, for
example, that a part of message that contains from 1 to 3 numbers in 4
quads with dots is really an IP address. However, it will likely fails
more often than not, and it is kinda hard :-) for it to tell a
username from a password (both are strings). Or, one can do analysis
without tokenizing the logs into a common format, such as with Bayes,
by treating them as pretty much English text...

So, any more ideas from the group on handling it?

Best,
--
Anton Chuvakin, Ph.D., GCIA, GCIH, GCFA
         http://www.chuvakin.org
    http://www.securitywarrior.com
_______________________________________________
LogAnalysis mailing list
LogAnalysis@private
http://lists.shmoo.com/mailman/listinfo/loganalysis
_______________________________________________
LogAnalysis mailing list
LogAnalysis@private
http://lists.shmoo.com/mailman/listinfo/loganalysis