Re: [logs] Re: Generic Log Message Parsing Tool

From: Adam Sah (asahat_private)
Date: Wed Jun 05 2002 - 07:54:29 PDT
Next message: Marcus J. Ranum: "Re: [logs] Re: Generic Log Message Parsing Tool"
Previous message: Marcus J. Ranum: "Re: [logs] Re: Generic Log Message Parsing Tool"
Next in thread: Sweth Chandramouli: "Re: [logs] Re: Generic Log Message Parsing Tool"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
neat-- thanks for the detailed reply, and I agree!  Regexps are a huge pain,
   but they're the lowest common denominator, which is why v1.0 uses them...
   I'd *love* to see something better, for which we can add support in the
   LMS (or a converter tool), then run the sucker in parallel.

one thing: LR(1) parsers are not a panacea-- yacc replaces the "scoring"
    problem with horrible shift-reduce conflicts between subtly conficting
    rules.  Having "done it for dollars" writing commercial compilers, I can
    attest that lex+yacc is not so "wonderful" in practice.  Maybe logs won't
    exhibit these problems...

adam


> >I don't know if this helps, but the Addamark LMS uses perl5 regular
> >   expressions to hack up the log into fields
> 
> My experience is that regexps are absolutely the wrong way to
> go about log parsing. Consider that regexps _are_ a parser (just
> a bad one!) and ask yourself what happens if, at the terminal
> nodes of a recursive descent parser, you have _another_ recursive
> descent parser. That just doesn't make any sense!!! The results
> you get by using regexps aren't predictable enough. To "fix" that
> you'll wind up doing match-scoring tricks (if you care) to
> differentiate between:
> "S[uU]: .*"
> and
> ".*"
> 
> So that's a contrived example but it brings out a couple of
> important things I noticed about log parsing:
>         - case doesn't matter: make everything case-insensitive
>         - wildcards are useless: whitespace is everything
> 
> The approach I was working on relied on correct matching of
> combinations of space and non-space. Regexps are really a pain
> in the butt if you want to match on whitespace. You need to use
> something like: " *" oops wait there could be "[ \t]*" and oops
> you can't handle newlines right... Eeeew...  Regexps are a good
> tool for simple searching - they're not a good tool for simple
> parsing. They're really not a good tool for complex parsing.
> 
> So let's look at another approach and then I'll generalize it
> to a broader approach... First, assume you're using a matching
> language that looks like:
> %s = matches string
> %6s = matches 6 chars in a string
> %d = matches number (including negative numbers)
> %f = matches a floating point #
> %w = matches whitespace explicitly (e.g.: %5w = matches 5 spaces
>         or tabs)
> (whitespace) matches whitespace - any quantity
> newlines can be literal embedded in the match string, which
> greatly helps readability...
> 
> and everything else is a literal. Everything is case insensitive.
> If you think about implementing something like this in C you can
> guess it's about 3 pages of code - 2 if you get tricky. It'll run
> extremely fast.
> 
> Now you match using as spec like:
> "Su: BADSU %s (tty %s)"
> which won't match everything but if your matching routine is
> fast and efficient enough you can use 3 or 4 rules and it'll
> still cost less than using a regexp AND be more reliable. Let
> me explain reliability scoring (something regexps don't have) -
> treat each item in the match string as a unique item worth
> 1 point when you match it. So, the pattern:
> "Su: BADSU %s (tty %s)"
> contains 19 items. "Su:" is 3. "Su: " is 4 (the whitespace) etc.
> If you mis-match anything the score drops to zero. If you get to
> the end of the match string, the match string with the most
> SPECIFIC match wins. Consider:
> "%s: %s %s (tty %s)"
> That's 14 items - even though it'd also match our BADSU message
> it's not worth as much as the more specific match string. So it's
> less likely to get picked.
> 
> Remember - if you're doing "log parsing" (whatever that is) you
> do NOT WANT something that will accidentally grab BAD LOGIN messages
> and stuff them into BADSU fields because someone forgot to add
> a "[ \t]*" someplace. You want it to be readable and as specific
> as possible. That way if a message comes in that does not have a
> specific matching rule it'll fall out the bottom of the system
> and someone will know to write a rule for it. Otherwise you'll
> end your rules with a
> "^.*$"
> and you've got a very cool log parser that unfortunately parses
> garbage.
> 
> Regexps force you to jump through hoops to match what you want.
> Regexps look like modem line noise and it's harder to train a
>         chimpanzee to write regexps than a simpler pattern matching
>         language.
> Regexps don't handle case insensitivity very well (depending on the
>         version) which means your expressions gain additional complexity
>         in order for you to accomplish something obvious that you need
>         to do frequently (bad!).
> Regexps are not as portable as we'd like them to be - various versions
>         crash, go non-linear, or lack features of other versions.
> Regexps handling of newlines is graceless in the extreme.
> Regexps lack match scoring and rely only on the length of the match
>         as the indicator (not the match of the template).
> 
> In short, I think people turn to regexps because they mistakenly
> perceive them as "easier" than writing 2-3 pages of code to build
> an efficient matching language that suits the problem at hand. ;)
> I think it's also probably the case that a lot of people want to
> use regexps because perl offers a convenient (if slow, awkward, and
> overcomplex) way of prototyping something. That's true, but why
> write something that you know is going to have a shortened
> useful lifespan just because you know the tool isn't suited for
> the job? I never understood that logic. ;)
> 
> Now let's look at advanced parsing techniques...
> 
> If you're going to play with parsing, you _HAVE_ to read the
> Aho, Weinberger, Kernighan book on compilers. The "Dragon" book.
> It's got some great explanations of recursive descent techniques
> and how to build parsers. Understanding what yacc does is key,
> because it's fast, flexible, and wonderful. ;) Regexps are
> basically mini recursive-descent parsers, FYI. Anyhow, what
> you want to be able to do (what I was working on before I had to
> quit...) is write a BNF notation for logs. It needn't be complex
> but consider something that looks like:
> 
> datetime:
>         "%d:%d:%d"
>                 {
>                         year=$1
>                         month=$2
>                         day=$3
>                         hour="?"
>                         min="?"
>                 }
>         "%d %d %d: %d %d"
>                 {
>                         year=$1
>                         month=$2
>                         day=$3
>                         hour=$4
>                         min=$5
>                 }
> 
> $datetime "badsu: %s (tty %s)"
> $datetime "sendmail blah blah"
> 
> Ok, now what you've done is defined a node for "datetime" and
> made several higher level productions depend on it. Now you
> can specify either a tuned date-time format for your machine
> or have several and let the parser pick the one that fits the
> best. Access the fields in $datetime as: $datetime.hour or
> whatever.
> 
> Now, what's cool about this approach is that you're building
> your parse tree on the fly. The first thing (in the example
> above) after datetime is either a literal "sendmail" or "su:"
> so you can build a prefix unbalanced n-way decision tree that
> lets you match _anything_ against an arbitrary sized set of
> matching rules without ever having to check more than on character
> of mis-match. Fast? Oh, yeah. It'll take you more than 3 pages
> of code, though, but the data structures aren't hard and the
> value of being able to load as many rules as you like into the
> system without slowing it significantly makes it worth the effort.
> 
> I guess what I'm saying is, "please, guys, study the problem and
> think about it a bit before you just grab perl and start throwing
> regexps around."
> 
> mjr.
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: loganalysis-unsubscribeat_private
For additional commands, e-mail: loganalysis-helpat_private
Next message: Marcus J. Ranum: "Re: [logs] Re: Generic Log Message Parsing Tool"
Previous message: Marcus J. Ranum: "Re: [logs] Re: Generic Log Message Parsing Tool"
Next in thread: Sweth Chandramouli: "Re: [logs] Re: Generic Log Message Parsing Tool"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
This archive was generated by hypermail 2b30 : Wed Jun 05 2002 - 08:09:28 PDT