Re: [logs] Re: Generic Log Message Parsing Tool

From: Marcus J. Ranum (mjrat_private)
Date: Wed Jun 05 2002 - 06:08:44 PDT

  • Next message: Marcus J. Ranum: "Re: [logs] Re: Generic Log Message Parsing Tool"

    >I don't know if this helps, but the Addamark LMS uses perl5 regular
    >   expressions to hack up the log into fields
    
    My experience is that regexps are absolutely the wrong way to
    go about log parsing. Consider that regexps _are_ a parser (just
    a bad one!) and ask yourself what happens if, at the terminal
    nodes of a recursive descent parser, you have _another_ recursive
    descent parser. That just doesn't make any sense!!! The results
    you get by using regexps aren't predictable enough. To "fix" that
    you'll wind up doing match-scoring tricks (if you care) to
    differentiate between:
    "S[uU]: .*"
    and
    ".*"
    
    So that's a contrived example but it brings out a couple of
    important things I noticed about log parsing:
            - case doesn't matter: make everything case-insensitive
            - wildcards are useless: whitespace is everything
    
    The approach I was working on relied on correct matching of
    combinations of space and non-space. Regexps are really a pain
    in the butt if you want to match on whitespace. You need to use
    something like: " *" oops wait there could be "[ \t]*" and oops
    you can't handle newlines right... Eeeew...  Regexps are a good
    tool for simple searching - they're not a good tool for simple
    parsing. They're really not a good tool for complex parsing.
    
    So let's look at another approach and then I'll generalize it
    to a broader approach... First, assume you're using a matching
    language that looks like:
    %s = matches string
    %6s = matches 6 chars in a string
    %d = matches number (including negative numbers)
    %f = matches a floating point #
    %w = matches whitespace explicitly (e.g.: %5w = matches 5 spaces
            or tabs)
    (whitespace) matches whitespace - any quantity
    newlines can be literal embedded in the match string, which
    greatly helps readability...
    
    and everything else is a literal. Everything is case insensitive.
    If you think about implementing something like this in C you can
    guess it's about 3 pages of code - 2 if you get tricky. It'll run
    extremely fast.
    
    Now you match using as spec like:
    "Su: BADSU %s (tty %s)"
    which won't match everything but if your matching routine is
    fast and efficient enough you can use 3 or 4 rules and it'll
    still cost less than using a regexp AND be more reliable. Let
    me explain reliability scoring (something regexps don't have) -
    treat each item in the match string as a unique item worth
    1 point when you match it. So, the pattern:
    "Su: BADSU %s (tty %s)"
    contains 19 items. "Su:" is 3. "Su: " is 4 (the whitespace) etc.
    If you mis-match anything the score drops to zero. If you get to
    the end of the match string, the match string with the most
    SPECIFIC match wins. Consider:
    "%s: %s %s (tty %s)"
    That's 14 items - even though it'd also match our BADSU message
    it's not worth as much as the more specific match string. So it's
    less likely to get picked.
    
    Remember - if you're doing "log parsing" (whatever that is) you
    do NOT WANT something that will accidentally grab BAD LOGIN messages
    and stuff them into BADSU fields because someone forgot to add
    a "[ \t]*" someplace. You want it to be readable and as specific
    as possible. That way if a message comes in that does not have a
    specific matching rule it'll fall out the bottom of the system
    and someone will know to write a rule for it. Otherwise you'll
    end your rules with a
    "^.*$"
    and you've got a very cool log parser that unfortunately parses
    garbage.
    
    Regexps force you to jump through hoops to match what you want.
    Regexps look like modem line noise and it's harder to train a
            chimpanzee to write regexps than a simpler pattern matching
            language.
    Regexps don't handle case insensitivity very well (depending on the
            version) which means your expressions gain additional complexity
            in order for you to accomplish something obvious that you need
            to do frequently (bad!).
    Regexps are not as portable as we'd like them to be - various versions
            crash, go non-linear, or lack features of other versions.
    Regexps handling of newlines is graceless in the extreme.
    Regexps lack match scoring and rely only on the length of the match
            as the indicator (not the match of the template).
    
    In short, I think people turn to regexps because they mistakenly
    perceive them as "easier" than writing 2-3 pages of code to build
    an efficient matching language that suits the problem at hand. ;)
    I think it's also probably the case that a lot of people want to
    use regexps because perl offers a convenient (if slow, awkward, and
    overcomplex) way of prototyping something. That's true, but why
    write something that you know is going to have a shortened
    useful lifespan just because you know the tool isn't suited for
    the job? I never understood that logic. ;)
    
    Now let's look at advanced parsing techniques...
    
    If you're going to play with parsing, you _HAVE_ to read the
    Aho, Weinberger, Kernighan book on compilers. The "Dragon" book.
    It's got some great explanations of recursive descent techniques
    and how to build parsers. Understanding what yacc does is key,
    because it's fast, flexible, and wonderful. ;) Regexps are
    basically mini recursive-descent parsers, FYI. Anyhow, what
    you want to be able to do (what I was working on before I had to
    quit...) is write a BNF notation for logs. It needn't be complex
    but consider something that looks like:
    
    datetime:
            "%d:%d:%d"
                    {
                            year=$1
                            month=$2
                            day=$3
                            hour="?"
                            min="?"
                    }
            "%d %d %d: %d %d"
                    {
                            year=$1
                            month=$2
                            day=$3
                            hour=$4
                            min=$5
                    }
    
    $datetime "badsu: %s (tty %s)"
    $datetime "sendmail blah blah"
    
    Ok, now what you've done is defined a node for "datetime" and
    made several higher level productions depend on it. Now you
    can specify either a tuned date-time format for your machine
    or have several and let the parser pick the one that fits the
    best. Access the fields in $datetime as: $datetime.hour or
    whatever.
    
    Now, what's cool about this approach is that you're building
    your parse tree on the fly. The first thing (in the example
    above) after datetime is either a literal "sendmail" or "su:"
    so you can build a prefix unbalanced n-way decision tree that
    lets you match _anything_ against an arbitrary sized set of
    matching rules without ever having to check more than on character
    of mis-match. Fast? Oh, yeah. It'll take you more than 3 pages
    of code, though, but the data structures aren't hard and the
    value of being able to load as many rules as you like into the
    system without slowing it significantly makes it worth the effort.
    
    I guess what I'm saying is, "please, guys, study the problem and
    think about it a bit before you just grab perl and start throwing
    regexps around."
    
    mjr.
    
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: loganalysis-unsubscribeat_private
    For additional commands, e-mail: loganalysis-helpat_private
    



    This archive was generated by hypermail 2b30 : Wed Jun 05 2002 - 08:10:18 PDT