[logs] Re: regex-less parsing of messages

ivan.arce@private

An old idea comes to mind from a now deprecated project at Core.

Representing and manipulating the log data visually so you identify
patterns or derive 'principal components' using a GUI. That is
manipulating the *visual representation* of the log data not the log
data itself.

The premise for the project was that the best pattern recognition device
available is the human brain so one just needs to present information in
a manner that helps the human brain to digest it.

In one of the prototypes that Core developed (Core Wisdom) there was a
tokenize-by-example feature where you would create rules from one sample
logline and they would be applied automatically to the entire log.

The prototypes and companion docs are here:
http://www.coresecurity.com/corelabs/projects/event_visualization_and_analysis.php

-ivan

Anton Chuvakin wrote:
> All,
> 
> Its time for me to come out of lurking again :-) Here is the thing:
> when people want to analyze logs, the first stage is often to tokenize
> (or "parse" as some say) the logs to some manageable format (XML
> anyone?) for analysis of RDBMS storage.
> 
> However, if logs are very diverse and lack a format in the first
> place, the above becomes a mammoth task, since one has to write a lot
> of ugly regular expressions. In addition, if a message with a new
> format comes out of the woodwork, a new regex needs to be created. Or,
> a silly generic regex is used (such as the one that only tokenizes the
> date and the device name from a Unix syslog message).
> 
> What are the possible ways around it? From what I know, none of the
> easy or fun ones. One might try to use clustering (such as 'slct') to
> try to identify the variable and stable parts of messages from a bulk
> of them, but that still does not make them tokenized. Or, one can try
> to create a "brute forcing parser" that will try to guess, for
> example, that a part of message that contains from 1 to 3 numbers in 4
> quads with dots is really an IP address. However, it will likely fails
> more often than not, and it is kinda hard :-) for it to tell a
> username from a password (both are strings). Or, one can do analysis
> without tokenizing the logs into a common format, such as with Bayes,
> by treating them as pretty much English text...
> 
> So, any more ideas from the group on handling it?
> 
> Best,
> --
> Anton Chuvakin, Ph.D., GCIA, GCIH, GCFA
>          http://www.chuvakin.org
>     http://www.securitywarrior.com
> _______________________________________________
> LogAnalysis mailing list
> LogAnalysis@private
> http://lists.shmoo.com/mailman/listinfo/loganalysis

-- 
---
To strive, to seek, to find, and not to yield.
- Alfred, Lord Tennyson Ulysses,1842

Ivan Arce
CTO
CORE SECURITY TECHNOLOGIES

46 Farnsworth Street
Boston, MA 02210
Ph: 617-399-6980
Fax: 617-399-6987
ivan.arce@private
www.coresecurity.com

PGP Fingerprint: C7A8 ED85 8D7B 9ADC 6836  B25D 207B E78E 2AD1 F65A

_______________________________________________
LogAnalysis mailing list
LogAnalysis@private
http://lists.shmoo.com/mailman/listinfo/loganalysis