This is literally the million dollar question of the log analysis industry and as a log analyzer developer it's on my mind every day. I guess everyone agrees that the log analysis should be the job of A.I.s and given the current technologies there are just few potential approaches: 1. Expert system - a collection of empirical data and decision algorithms compiled by developers - most of the log analysis solutions (including ours) implement this type of AI. 2. Hidden Markov models - since they are used in natural language and speech processing they might be applicable to log entries (they are after all some type of "natural speech"). 3. Neural nets - Once built, the neural net would be trained by experienced teachers (log analysis gurus). 4. Genetic algorithms - The trick would be to 1. define the right requirements (for example, determine the least number of message types without discarding significant data) and 2. define the genetic codes for the solution organisms. Maybe GAs are a bit far fetched but I wouldn't exclude them. The problem is that most developers can only program some sort of expert system and add rules to it using brute force. The other 3 methods (if really applicable to log parsing), require (very) advanced mathematical skills and expensive hardware - this is the realm of Ph.D's and research labs. The way I see it, we'll be stuck with "expert systems" for a while - the market for log analysis software is not that rich to justify the type of investments required to keep a couple of Ph.D's on your payroll. Anton mentioned Bayes but personally I would see Bayesian logic used in analyzing the results of the log analysis and not in the actual parsing of the log entries. The analyzer would continually learn patterns from the daily traffic (so the ability to raise "real alarms" and discard false positives would increase with every log analyzed). Regards, Adrian Grigorof www.firegen.com ----- Original Message ----- From: "Anton Chuvakin" <anton@private> To: <LogAnalysis@private> Sent: Sunday, December 04, 2005 18:58 Subject: [logs] regex-less parsing of messages > All, > > Its time for me to come out of lurking again :-) Here is the thing: > when people want to analyze logs, the first stage is often to tokenize > (or "parse" as some say) the logs to some manageable format (XML > anyone?) for analysis of RDBMS storage. > > However, if logs are very diverse and lack a format in the first > place, the above becomes a mammoth task, since one has to write a lot > of ugly regular expressions. In addition, if a message with a new > format comes out of the woodwork, a new regex needs to be created. Or, > a silly generic regex is used (such as the one that only tokenizes the > date and the device name from a Unix syslog message). > > What are the possible ways around it? From what I know, none of the > easy or fun ones. One might try to use clustering (such as 'slct') to > try to identify the variable and stable parts of messages from a bulk > of them, but that still does not make them tokenized. Or, one can try > to create a "brute forcing parser" that will try to guess, for > example, that a part of message that contains from 1 to 3 numbers in 4 > quads with dots is really an IP address. However, it will likely fails > more often than not, and it is kinda hard :-) for it to tell a > username from a password (both are strings). Or, one can do analysis > without tokenizing the logs into a common format, such as with Bayes, > by treating them as pretty much English text... > > So, any more ideas from the group on handling it? > > Best, > -- > Anton Chuvakin, Ph.D., GCIA, GCIH, GCFA > http://www.chuvakin.org > http://www.securitywarrior.com > _______________________________________________ > LogAnalysis mailing list > LogAnalysis@private > http://lists.shmoo.com/mailman/listinfo/loganalysis > > _______________________________________________ LogAnalysis mailing list LogAnalysis@private http://lists.shmoo.com/mailman/listinfo/loganalysis
This archive was generated by hypermail 2.1.3 : Tue Dec 06 2005 - 13:31:35 PST