[logs] Re: regex-less parsing of messages

adi@private

This is literally the million dollar question of the log analysis industry
and as a log analyzer developer it's on my mind every day. I guess everyone
agrees that the log analysis should be the job of A.I.s and given the
current technologies there are just few potential approaches:

1. Expert system  - a collection of empirical data and decision algorithms
compiled by developers - most of the log analysis solutions (including ours)
implement this type of AI.
2. Hidden Markov models - since they are used in natural language and speech
processing they might be applicable to log entries (they are after all some
type of  "natural speech").
3. Neural nets - Once built, the neural net would be trained by experienced
teachers (log analysis gurus).
4. Genetic algorithms - The trick would be to 1. define the right
requirements (for example, determine the least number of message types
without discarding significant data) and 2. define the genetic codes for the
solution organisms. Maybe GAs are a bit far fetched but I wouldn't exclude
them.

The problem is that most developers can only program some sort of expert
system and add rules to it using brute force. The other 3 methods (if really
applicable to log parsing), require (very) advanced mathematical skills and
expensive hardware - this is the realm of Ph.D's and research labs. The way
I see it, we'll be stuck with "expert systems" for a while - the market for
log analysis software is not that rich to justify the type of investments
required to keep a couple of Ph.D's on your payroll.

Anton mentioned Bayes but personally I would see Bayesian logic used in
analyzing the results of the log analysis and not in the actual parsing of
the log entries. The analyzer would continually learn patterns from the
daily traffic (so the ability to raise "real alarms" and discard false
positives would increase with every log analyzed).

Regards,

Adrian Grigorof
www.firegen.com

----- Original Message ----- 
From: "Anton Chuvakin" <anton@private>
To: <LogAnalysis@private>
Sent: Sunday, December 04, 2005 18:58
Subject: [logs] regex-less parsing of messages

> All,
>
> Its time for me to come out of lurking again :-) Here is the thing:
> when people want to analyze logs, the first stage is often to tokenize
> (or "parse" as some say) the logs to some manageable format (XML
> anyone?) for analysis of RDBMS storage.
>
> However, if logs are very diverse and lack a format in the first
> place, the above becomes a mammoth task, since one has to write a lot
> of ugly regular expressions. In addition, if a message with a new
> format comes out of the woodwork, a new regex needs to be created. Or,
> a silly generic regex is used (such as the one that only tokenizes the
> date and the device name from a Unix syslog message).
>
> What are the possible ways around it? From what I know, none of the
> easy or fun ones. One might try to use clustering (such as 'slct') to
> try to identify the variable and stable parts of messages from a bulk
> of them, but that still does not make them tokenized. Or, one can try
> to create a "brute forcing parser" that will try to guess, for
> example, that a part of message that contains from 1 to 3 numbers in 4
> quads with dots is really an IP address. However, it will likely fails
> more often than not, and it is kinda hard :-) for it to tell a
> username from a password (both are strings). Or, one can do analysis
> without tokenizing the logs into a common format, such as with Bayes,
> by treating them as pretty much English text...
>
> So, any more ideas from the group on handling it?
>
> Best,
> --
> Anton Chuvakin, Ph.D., GCIA, GCIH, GCFA
>          http://www.chuvakin.org
>     http://www.securitywarrior.com
> _______________________________________________
> LogAnalysis mailing list
> LogAnalysis@private
> http://lists.shmoo.com/mailman/listinfo/loganalysis
>
>

_______________________________________________
LogAnalysis mailing list
LogAnalysis@private
http://lists.shmoo.com/mailman/listinfo/loganalysis