RE: [logs] Re: Generic Log Message Parsing Tool

From: Dale.Drewat_private
Date: Wed Jun 05 2002 - 12:05:14 PDT

  • Next message: Marcus J. Ranum: "Re: [logs] Re: Generic Log Message Parsing Tool"

    For anyone whom is interested, Ive recently made a log policy engine (APE -
    Anomaly Policy Engine) available.  It's a fairly flexible, very robust log
    parsing agent that has a significant amount of rule hierarchy support and
    handles data at very high speeds, regardless of the fact that its written in
    perl.  :-)
     
    For those of you whom are interested, you can find it at
    http://www.hackertracker.org/cst/cst.html
    <http://www.hackertracker.org/cst/cst.html>  under the "Intrusion Detection
    server" link.
     
    Dale
     
    -----Original Message-----
    From: Adam Sah [mailto:asahat_private] 
    Sent: Tuesday, June 04, 2002 6:33 PM
    To: loganalysisat_private
    Subject: Re: [logs] Re: Generic Log Message Parsing Tool 
     
    I don't know if this helps, but the Addamark LMS uses perl5 regular 
       expressions to hack up the log into fields, then hits those fields with 
       arbitrary expressions (SQL+perl).  We solve the performance problem by 
       running the parse in parallel across a cluster of PCs-- this also
    provides 
       linear scaling.  In practice, we've never had a problem parsing up 
       somebody's log, including some crazy custom ones. 
    Anyway, I've included a little writeup on our scheme/format below.  If it's 
       helpful, feel free to steal the ideas-- our goal is to be compatible with
    
       whatever parsing format(s) become popular, and if they're based on us, 
       that only makes our job easier ;-) 
    adam 
    Adam Sah -- CTO, Addamark Technologies -- http://www.addamark.com/
    <http://www.addamark.com/>  
    ..tear.along.dotted.line................................................ 
    The Addamark parsing script format is as follows: 
       ^...your regexp here...$ 
       name1:type,name2:type,name3:type,name4:type,... 
       ...your code here... 
    The regexp locates the individual fields in a given record, each match 
       (paren-match) is given a "name" and forced into the given datatype as per
    
       the name:type line.  These parse fields are then made available to the 
       code section, e.g. as variables.  In our case, the "code" is a SQL 
       statement, in which you can embed Perl.  If you don't feel like writing a
    
       SQL engine (understandable!), you could jump straight into Perl.  For 
       readability, we've added "...X" to the regexp language which means 
       "create a match out of anything up to the next X" 
    For example, here's the Addamark script to parse an Apache weblog: 
    ^... ... ... \[...\] "... ... ..." ... ... "..." "..." ...$ 
    ClientIP:VARCHAR,unused1:VARCHAR,unused2:VARCHAR,tsStr:VARCHAR, 
       Method:VARCHAR, Url:VARCHAR, HttpVers:VARCHAR, RespCode:INT32, 
       RespSize:INT32, Referrer:VARCHAR, UserAgent:VARCHAR, RespTime:VARCHAR 
    SELECT  _strptime( tsStr, "%d/%b/%Y:%H:%M:%S %Z") as ts, 
            ClientIP, 
            _rev_dns(ClientIP) as ClientDNS, -- do a reverse DNS lookup, 
                      -- this too happens in parallel across the cluster 
            Method, 
            Url, 
            HttpVers, 
            RespCode, 
            RespSize, 
            Referrer, 
            UserAgent, 
            _int32(RespTime) as RespTime, -- another way to parse strings to
    nums 
    FROM stdin; 
    You could replace the SELECT statement with some arbitrary Perl code, which 
       has some API for defining the output columns. 
    notes: 
     - Multi-line records are relatively rare, so we handle them by
    pre-processing 
       the log data so that the records are all on one line apiece.  
     - Variant records are handled using regexp "union" ("|").  This doubles the
    
       number of parse fields, but that's easily re-unified in the code section.
    
       Perl5 regexps already handle binary data. 
     
    > On Tue, Jun 04, 2002 at 05:36:05PM -0400, Steve wrote: 
    > > I've been working on understanding the Perl module Parse::RecDescent 
    > > for just such a thing.  I suspect it would be possible to create a 
    > > stockpile of its "grammars" for many established log formats, and then 
    > > people would have an easier time modifying it for new formats. 
    >       That's a large part of what I've got so far; the problems 
    > I'm running into are scalability/performance ones--Parse::RecDescent is 
    > a beautiful beast, but not a very fast one at all.  Marcus was doing his 
    > version in C, which for performance reasons makes a lot of sense, so I 
    > was thinking that perhaps someone else might be able to pick up that 
    > path.  Perhaps a good way to start would be to ignore the implementation 
    > issues for now, and just start building a stockpile of grammars as you 
    > suggest; it should be relatively easy to convert a well-formed grammar to 
    > a lex/yacc syntax, yes?  (I don't know, as my C skills are less than 
    > stellar and I've never actually used lex/yacc.) 
    >       (I also started a conversation this morning with Damian 
    > Conway and Mark-Jason Dominus about a faster way to implement a parser 
    > in Perl, using iteration rather than recursion; it might be a long time 
    > before that pans out, but if it does, maybe Perl could remain a valid 
    > option as well.) 
    > 
    >       -- Sweth. 
    > 
    > -- 
    > Sweth Chandramouli      Idiopathic Systems Consulting 
    > svcat_private      http://www.idiopathic.net/
    <http://www.idiopathic.net/>  
    > 
    > --------------------------------------------------------------------- 
    > To unsubscribe, e-mail: loganalysis-unsubscribeat_private 
    > For additional commands, e-mail: loganalysis-helpat_private 
    > 
     
    --------------------------------------------------------------------- 
    To unsubscribe, e-mail: loganalysis-unsubscribeat_private 
    For additional commands, e-mail: loganalysis-helpat_private 
     
    



    This archive was generated by hypermail 2b30 : Wed Jun 05 2002 - 12:09:38 PDT