Re: [logs] Re: Generic Log Message Parsing Tool

From: Marcus J. Ranum (mjrat_private)
Date: Wed Jun 05 2002 - 06:08:44 PDT

Next message: Marcus J. Ranum: "Re: [logs] Re: Generic Log Message Parsing Tool"

Previous message: Adam Sah: "Re: [logs] Re: Generic Log Message Parsing Tool"
In reply to: Adam Sah: "Re: [logs] Re: Generic Log Message Parsing Tool"
Next in thread: Adam Sah: "Re: [logs] Re: Generic Log Message Parsing Tool"
Reply: Adam Sah: "Re: [logs] Re: Generic Log Message Parsing Tool"
Reply: Sweth Chandramouli: "Re: [logs] Re: Generic Log Message Parsing Tool"
Reply: Steffen Kluge: "Re: [logs] Re: Generic Log Message Parsing Tool"
Reply: Greg Black: "Re: [logs] Re: Generic Log Message Parsing Tool"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

>I don't know if this helps, but the Addamark LMS uses perl5 regular
>   expressions to hack up the log into fields

My experience is that regexps are absolutely the wrong way to
go about log parsing. Consider that regexps _are_ a parser (just
a bad one!) and ask yourself what happens if, at the terminal
nodes of a recursive descent parser, you have _another_ recursive
descent parser. That just doesn't make any sense!!! The results
you get by using regexps aren't predictable enough. To "fix" that
you'll wind up doing match-scoring tricks (if you care) to
differentiate between:
"S[uU]: .*"
and
".*"

So that's a contrived example but it brings out a couple of
important things I noticed about log parsing:
        - case doesn't matter: make everything case-insensitive
        - wildcards are useless: whitespace is everything

The approach I was working on relied on correct matching of
combinations of space and non-space. Regexps are really a pain
in the butt if you want to match on whitespace. You need to use
something like: " *" oops wait there could be "[ \t]*" and oops
you can't handle newlines right... Eeeew...  Regexps are a good
tool for simple searching - they're not a good tool for simple
parsing. They're really not a good tool for complex parsing.

So let's look at another approach and then I'll generalize it
to a broader approach... First, assume you're using a matching
language that looks like:
%s = matches string
%6s = matches 6 chars in a string
%d = matches number (including negative numbers)
%f = matches a floating point #
%w = matches whitespace explicitly (e.g.: %5w = matches 5 spaces
        or tabs)
(whitespace) matches whitespace - any quantity
newlines can be literal embedded in the match string, which
greatly helps readability...

and everything else is a literal. Everything is case insensitive.
If you think about implementing something like this in C you can
guess it's about 3 pages of code - 2 if you get tricky. It'll run
extremely fast.

Now you match using as spec like:
"Su: BADSU %s (tty %s)"
which won't match everything but if your matching routine is
fast and efficient enough you can use 3 or 4 rules and it'll
still cost less than using a regexp AND be more reliable. Let
me explain reliability scoring (something regexps don't have) -
treat each item in the match string as a unique item worth
1 point when you match it. So, the pattern:
"Su: BADSU %s (tty %s)"
contains 19 items. "Su:" is 3. "Su: " is 4 (the whitespace) etc.
If you mis-match anything the score drops to zero. If you get to
the end of the match string, the match string with the most
SPECIFIC match wins. Consider:
"%s: %s %s (tty %s)"
That's 14 items - even though it'd also match our BADSU message
it's not worth as much as the more specific match string. So it's
less likely to get picked.

Remember - if you're doing "log parsing" (whatever that is) you
do NOT WANT something that will accidentally grab BAD LOGIN messages
and stuff them into BADSU fields because someone forgot to add
a "[ \t]*" someplace. You want it to be readable and as specific
as possible. That way if a message comes in that does not have a
specific matching rule it'll fall out the bottom of the system
and someone will know to write a rule for it. Otherwise you'll
end your rules with a
"^.*$"
and you've got a very cool log parser that unfortunately parses
garbage.

Regexps force you to jump through hoops to match what you want.
Regexps look like modem line noise and it's harder to train a
        chimpanzee to write regexps than a simpler pattern matching
        language.
Regexps don't handle case insensitivity very well (depending on the
        version) which means your expressions gain additional complexity
        in order for you to accomplish something obvious that you need
        to do frequently (bad!).
Regexps are not as portable as we'd like them to be - various versions
        crash, go non-linear, or lack features of other versions.
Regexps handling of newlines is graceless in the extreme.
Regexps lack match scoring and rely only on the length of the match
        as the indicator (not the match of the template).

In short, I think people turn to regexps because they mistakenly
perceive them as "easier" than writing 2-3 pages of code to build
an efficient matching language that suits the problem at hand. ;)
I think it's also probably the case that a lot of people want to
use regexps because perl offers a convenient (if slow, awkward, and
overcomplex) way of prototyping something. That's true, but why
write something that you know is going to have a shortened
useful lifespan just because you know the tool isn't suited for
the job? I never understood that logic. ;)

Now let's look at advanced parsing techniques...

If you're going to play with parsing, you _HAVE_ to read the
Aho, Weinberger, Kernighan book on compilers. The "Dragon" book.
It's got some great explanations of recursive descent techniques
and how to build parsers. Understanding what yacc does is key,
because it's fast, flexible, and wonderful. ;) Regexps are
basically mini recursive-descent parsers, FYI. Anyhow, what
you want to be able to do (what I was working on before I had to
quit...) is write a BNF notation for logs. It needn't be complex
but consider something that looks like:

datetime:
        "%d:%d:%d"
                {
                        year=$1
                        month=$2
                        day=$3
                        hour="?"
                        min="?"
                }
        "%d %d %d: %d %d"
                {
                        year=$1
                        month=$2
                        day=$3
                        hour=$4
                        min=$5
                }

$datetime "badsu: %s (tty %s)"
$datetime "sendmail blah blah"

Ok, now what you've done is defined a node for "datetime" and
made several higher level productions depend on it. Now you
can specify either a tuned date-time format for your machine
or have several and let the parser pick the one that fits the
best. Access the fields in $datetime as: $datetime.hour or
whatever.

Now, what's cool about this approach is that you're building
your parse tree on the fly. The first thing (in the example
above) after datetime is either a literal "sendmail" or "su:"
so you can build a prefix unbalanced n-way decision tree that
lets you match _anything_ against an arbitrary sized set of
matching rules without ever having to check more than on character
of mis-match. Fast? Oh, yeah. It'll take you more than 3 pages
of code, though, but the data structures aren't hard and the
value of being able to load as many rules as you like into the
system without slowing it significantly makes it worth the effort.

I guess what I'm saying is, "please, guys, study the problem and
think about it a bit before you just grab perl and start throwing
regexps around."

mjr.


---------------------------------------------------------------------
To unsubscribe, e-mail: loganalysis-unsubscribeat_private
For additional commands, e-mail: loganalysis-helpat_private

Next message: Marcus J. Ranum: "Re: [logs] Re: Generic Log Message Parsing Tool"
Previous message: Adam Sah: "Re: [logs] Re: Generic Log Message Parsing Tool"
In reply to: Adam Sah: "Re: [logs] Re: Generic Log Message Parsing Tool"
Next in thread: Adam Sah: "Re: [logs] Re: Generic Log Message Parsing Tool"
Reply: Adam Sah: "Re: [logs] Re: Generic Log Message Parsing Tool"
Reply: Sweth Chandramouli: "Re: [logs] Re: Generic Log Message Parsing Tool"
Reply: Steffen Kluge: "Re: [logs] Re: Generic Log Message Parsing Tool"
Reply: Greg Black: "Re: [logs] Re: Generic Log Message Parsing Tool"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b30 : Wed Jun 05 2002 - 08:10:18 PDT