Re: [logs] Re: Generic Log Message Parsing Tool

From: Adam Sah (asahat_private)
Date: Wed Jun 05 2002 - 07:54:29 PDT

  • Next message: Marcus J. Ranum: "Re: [logs] Re: Generic Log Message Parsing Tool"

    neat-- thanks for the detailed reply, and I agree!  Regexps are a huge pain,
       but they're the lowest common denominator, which is why v1.0 uses them...
       I'd *love* to see something better, for which we can add support in the
       LMS (or a converter tool), then run the sucker in parallel.
    
    one thing: LR(1) parsers are not a panacea-- yacc replaces the "scoring"
        problem with horrible shift-reduce conflicts between subtly conficting
        rules.  Having "done it for dollars" writing commercial compilers, I can
        attest that lex+yacc is not so "wonderful" in practice.  Maybe logs won't
        exhibit these problems...
    
    adam
    
    
    > >I don't know if this helps, but the Addamark LMS uses perl5 regular
    > >   expressions to hack up the log into fields
    > 
    > My experience is that regexps are absolutely the wrong way to
    > go about log parsing. Consider that regexps _are_ a parser (just
    > a bad one!) and ask yourself what happens if, at the terminal
    > nodes of a recursive descent parser, you have _another_ recursive
    > descent parser. That just doesn't make any sense!!! The results
    > you get by using regexps aren't predictable enough. To "fix" that
    > you'll wind up doing match-scoring tricks (if you care) to
    > differentiate between:
    > "S[uU]: .*"
    > and
    > ".*"
    > 
    > So that's a contrived example but it brings out a couple of
    > important things I noticed about log parsing:
    >         - case doesn't matter: make everything case-insensitive
    >         - wildcards are useless: whitespace is everything
    > 
    > The approach I was working on relied on correct matching of
    > combinations of space and non-space. Regexps are really a pain
    > in the butt if you want to match on whitespace. You need to use
    > something like: " *" oops wait there could be "[ \t]*" and oops
    > you can't handle newlines right... Eeeew...  Regexps are a good
    > tool for simple searching - they're not a good tool for simple
    > parsing. They're really not a good tool for complex parsing.
    > 
    > So let's look at another approach and then I'll generalize it
    > to a broader approach... First, assume you're using a matching
    > language that looks like:
    > %s = matches string
    > %6s = matches 6 chars in a string
    > %d = matches number (including negative numbers)
    > %f = matches a floating point #
    > %w = matches whitespace explicitly (e.g.: %5w = matches 5 spaces
    >         or tabs)
    > (whitespace) matches whitespace - any quantity
    > newlines can be literal embedded in the match string, which
    > greatly helps readability...
    > 
    > and everything else is a literal. Everything is case insensitive.
    > If you think about implementing something like this in C you can
    > guess it's about 3 pages of code - 2 if you get tricky. It'll run
    > extremely fast.
    > 
    > Now you match using as spec like:
    > "Su: BADSU %s (tty %s)"
    > which won't match everything but if your matching routine is
    > fast and efficient enough you can use 3 or 4 rules and it'll
    > still cost less than using a regexp AND be more reliable. Let
    > me explain reliability scoring (something regexps don't have) -
    > treat each item in the match string as a unique item worth
    > 1 point when you match it. So, the pattern:
    > "Su: BADSU %s (tty %s)"
    > contains 19 items. "Su:" is 3. "Su: " is 4 (the whitespace) etc.
    > If you mis-match anything the score drops to zero. If you get to
    > the end of the match string, the match string with the most
    > SPECIFIC match wins. Consider:
    > "%s: %s %s (tty %s)"
    > That's 14 items - even though it'd also match our BADSU message
    > it's not worth as much as the more specific match string. So it's
    > less likely to get picked.
    > 
    > Remember - if you're doing "log parsing" (whatever that is) you
    > do NOT WANT something that will accidentally grab BAD LOGIN messages
    > and stuff them into BADSU fields because someone forgot to add
    > a "[ \t]*" someplace. You want it to be readable and as specific
    > as possible. That way if a message comes in that does not have a
    > specific matching rule it'll fall out the bottom of the system
    > and someone will know to write a rule for it. Otherwise you'll
    > end your rules with a
    > "^.*$"
    > and you've got a very cool log parser that unfortunately parses
    > garbage.
    > 
    > Regexps force you to jump through hoops to match what you want.
    > Regexps look like modem line noise and it's harder to train a
    >         chimpanzee to write regexps than a simpler pattern matching
    >         language.
    > Regexps don't handle case insensitivity very well (depending on the
    >         version) which means your expressions gain additional complexity
    >         in order for you to accomplish something obvious that you need
    >         to do frequently (bad!).
    > Regexps are not as portable as we'd like them to be - various versions
    >         crash, go non-linear, or lack features of other versions.
    > Regexps handling of newlines is graceless in the extreme.
    > Regexps lack match scoring and rely only on the length of the match
    >         as the indicator (not the match of the template).
    > 
    > In short, I think people turn to regexps because they mistakenly
    > perceive them as "easier" than writing 2-3 pages of code to build
    > an efficient matching language that suits the problem at hand. ;)
    > I think it's also probably the case that a lot of people want to
    > use regexps because perl offers a convenient (if slow, awkward, and
    > overcomplex) way of prototyping something. That's true, but why
    > write something that you know is going to have a shortened
    > useful lifespan just because you know the tool isn't suited for
    > the job? I never understood that logic. ;)
    > 
    > Now let's look at advanced parsing techniques...
    > 
    > If you're going to play with parsing, you _HAVE_ to read the
    > Aho, Weinberger, Kernighan book on compilers. The "Dragon" book.
    > It's got some great explanations of recursive descent techniques
    > and how to build parsers. Understanding what yacc does is key,
    > because it's fast, flexible, and wonderful. ;) Regexps are
    > basically mini recursive-descent parsers, FYI. Anyhow, what
    > you want to be able to do (what I was working on before I had to
    > quit...) is write a BNF notation for logs. It needn't be complex
    > but consider something that looks like:
    > 
    > datetime:
    >         "%d:%d:%d"
    >                 {
    >                         year=$1
    >                         month=$2
    >                         day=$3
    >                         hour="?"
    >                         min="?"
    >                 }
    >         "%d %d %d: %d %d"
    >                 {
    >                         year=$1
    >                         month=$2
    >                         day=$3
    >                         hour=$4
    >                         min=$5
    >                 }
    > 
    > $datetime "badsu: %s (tty %s)"
    > $datetime "sendmail blah blah"
    > 
    > Ok, now what you've done is defined a node for "datetime" and
    > made several higher level productions depend on it. Now you
    > can specify either a tuned date-time format for your machine
    > or have several and let the parser pick the one that fits the
    > best. Access the fields in $datetime as: $datetime.hour or
    > whatever.
    > 
    > Now, what's cool about this approach is that you're building
    > your parse tree on the fly. The first thing (in the example
    > above) after datetime is either a literal "sendmail" or "su:"
    > so you can build a prefix unbalanced n-way decision tree that
    > lets you match _anything_ against an arbitrary sized set of
    > matching rules without ever having to check more than on character
    > of mis-match. Fast? Oh, yeah. It'll take you more than 3 pages
    > of code, though, but the data structures aren't hard and the
    > value of being able to load as many rules as you like into the
    > system without slowing it significantly makes it worth the effort.
    > 
    > I guess what I'm saying is, "please, guys, study the problem and
    > think about it a bit before you just grab perl and start throwing
    > regexps around."
    > 
    > mjr.
    > 
    
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: loganalysis-unsubscribeat_private
    For additional commands, e-mail: loganalysis-helpat_private
    



    This archive was generated by hypermail 2b30 : Wed Jun 05 2002 - 08:09:28 PDT