Re: [logs] Re: Generic Log Message Parsing Tool

From: Adam Sah (asahat_private)
Date: Tue Jun 04 2002 - 17:33:02 PDT

  • Next message: Sweth Chandramouli: "Re: [logs] Re: Generic Log Message Parsing Tool"

    I don't know if this helps, but the Addamark LMS uses perl5 regular
       expressions to hack up the log into fields, then hits those fields with
       arbitrary expressions (SQL+perl).  We solve the performance problem by
       running the parse in parallel across a cluster of PCs-- this also provides
       linear scaling.  In practice, we've never had a problem parsing up
       somebody's log, including some crazy custom ones.
    
    Anyway, I've included a little writeup on our scheme/format below.  If it's
       helpful, feel free to steal the ideas-- our goal is to be compatible with
       whatever parsing format(s) become popular, and if they're based on us,
       that only makes our job easier ;-)
    
    adam
    Adam Sah -- CTO, Addamark Technologies -- http://www.addamark.com/
    
    ..tear.along.dotted.line................................................
    
    The Addamark parsing script format is as follows:
    
       ^...your regexp here...$
       name1:type,name2:type,name3:type,name4:type,...
    
       ...your code here...
    
    The regexp locates the individual fields in a given record, each match
       (paren-match) is given a "name" and forced into the given datatype as per
       the name:type line.  These parse fields are then made available to the
       code section, e.g. as variables.  In our case, the "code" is a SQL
       statement, in which you can embed Perl.  If you don't feel like writing a
       SQL engine (understandable!), you could jump straight into Perl.  For
       readability, we've added "...X" to the regexp language which means
       "create a match out of anything up to the next X"
    
    For example, here's the Addamark script to parse an Apache weblog:
    
    ^... ... ... \[...\] "... ... ..." ... ... "..." "..." ...$
    ClientIP:VARCHAR,unused1:VARCHAR,unused2:VARCHAR,tsStr:VARCHAR,
       Method:VARCHAR, Url:VARCHAR, HttpVers:VARCHAR, RespCode:INT32,
       RespSize:INT32, Referrer:VARCHAR, UserAgent:VARCHAR, RespTime:VARCHAR
    SELECT  _strptime( tsStr, "%d/%b/%Y:%H:%M:%S %Z") as ts,
            ClientIP,
            _rev_dns(ClientIP) as ClientDNS, -- do a reverse DNS lookup,
                      -- this too happens in parallel across the cluster
            Method,
            Url,
            HttpVers,
            RespCode,
            RespSize,
            Referrer,
            UserAgent,
            _int32(RespTime) as RespTime, -- another way to parse strings to nums
    FROM stdin;
    
    You could replace the SELECT statement with some arbitrary Perl code, which
       has some API for defining the output columns.
    
    notes:
     - Multi-line records are relatively rare, so we handle them by pre-processing
       the log data so that the records are all on one line apiece.  
    
     - Variant records are handled using regexp "union" ("|").  This doubles the
       number of parse fields, but that's easily re-unified in the code section.
       Perl5 regexps already handle binary data.
    
    
    > On Tue, Jun 04, 2002 at 05:36:05PM -0400, Steve wrote:
    > > I've been working on understanding the Perl module Parse::RecDescent
    > > for just such a thing.  I suspect it would be possible to create a
    > > stockpile of its "grammars" for many established log formats, and then
    > > people would have an easier time modifying it for new formats.
    > 	That's a large part of what I've got so far; the problems
    > I'm running into are scalability/performance ones--Parse::RecDescent is
    > a beautiful beast, but not a very fast one at all.  Marcus was doing his
    > version in C, which for performance reasons makes a lot of sense, so I
    > was thinking that perhaps someone else might be able to pick up that
    > path.  Perhaps a good way to start would be to ignore the implementation
    > issues for now, and just start building a stockpile of grammars as you
    > suggest; it should be relatively easy to convert a well-formed grammar to
    > a lex/yacc syntax, yes?  (I don't know, as my C skills are less than
    > stellar and I've never actually used lex/yacc.)
    > 	(I also started a conversation this morning with Damian
    > Conway and Mark-Jason Dominus about a faster way to implement a parser
    > in Perl, using iteration rather than recursion; it might be a long time
    > before that pans out, but if it does, maybe Perl could remain a valid
    > option as well.)
    > 
    > 	-- Sweth.
    > 
    > -- 
    > Sweth Chandramouli      Idiopathic Systems Consulting
    > svcat_private      http://www.idiopathic.net/
    > 
    > ---------------------------------------------------------------------
    > To unsubscribe, e-mail: loganalysis-unsubscribeat_private
    > For additional commands, e-mail: loganalysis-helpat_private
    > 
    
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: loganalysis-unsubscribeat_private
    For additional commands, e-mail: loganalysis-helpat_private
    



    This archive was generated by hypermail 2b30 : Tue Jun 04 2002 - 19:48:28 PDT