Re: [logs] Re: Generic Log Message Parsing Tool

From: Sweth Chandramouli (loganalysisat_private)
Date: Wed Jun 05 2002 - 10:42:49 PDT

  • Next message: yehuda: "RE: [logs] Re: Generic Log Message Parsing Tool"

    	[This is getting increasingly off-topic, except inasmuch
    as it will probably affect the final outcome of any generic log parsing
    effort; this will be my last post to the list on this subthread, then.]
    
    On Wed, Jun 05, 2002 at 09:08:44AM -0400, Marcus J. Ranum wrote:
    > >I don't know if this helps, but the Addamark LMS uses perl5 regular
    > >   expressions to hack up the log into fields
    > 
    > My experience is that regexps are absolutely the wrong way to
    > go about log parsing.
    	Agreed, to a certain extent.  Actual regexes (that is,
    ones that are "regular", in the set theory sense that gave them their
    name) _can't_ parse log messages, and while extended regexes like perl
    provides _can_ do it, few enough people really understand how they work
    that, beyond a certain level of complexity, they invariably get them
    wrong in one way or another.  That said, I don't have the same religious
    objections to regexes that you seem to; with a robust [1] regex
    implementation (like Perl's, or the Java ORO library).
    
    > The approach I was working on relied on correct matching of
    > combinations of space and non-space. Regexps are really a pain
    > in the butt if you want to match on whitespace. You need to use
    > something like: " *" oops wait there could be "[ \t]*" and oops
    > you can't handle newlines right... Eeeew...  Regexps are a good
    > tool for simple searching - they're not a good tool for simple
    > parsing.
    	Here's where we disagree most.  For simple parsing, I
    think there's little better than a well-understood regex engine.  The
    two I mentioned earlier are steller for things like you are describing,
    with macros like "\s" to match whitespace (and a flag to allow that to
    include newlines if dealing with a multiline pattern space), and it's
    trivial to to set case-insensitivity for either an entire regex
    ("/your_regex_here/i") or a small portion of it
    ("/your_(?i:regex)_here/").
    
    > They're really not a good tool for complex parsing.
    	This I'd agree with, but only because my definition of
    complex parsing is probably more complex than that of the average bear.
    I'd definitely include the parsing of an entire arbitrary log message in
    that definition, but I wouldn't include the parsing of a simple component
    of a known log message.  (I understand your argument that putting a
    recursive parser at the bottom of a recursive parse tree is rather
    painful, but when the pattern being matched is simple (for some
    definition of simple that I won't provide but will cop out and say that
    people will, like pornography, know when they see it), the issues become
    nonexistant.)
    
    > still cost less than using a regexp AND be more reliable. Let
    > me explain reliability scoring (something regexps don't have) -
    	Scoring is a huge plus, but I think it's orthogonal to
    what I'm proposing as a first step; there's nothing to say that a
    person implementing a particular log parser couldn't given the log
    message grammar repository that I'm now proposing, choose to provide
    scoring hooks for how well a node matches a particular portion of the
    message being parsed.  (Again, things like this are trivial with the
    Parse::RecDescent module in Perl, and while not necessarily trivial,
    they _are_ feasible with a pure regex implementation as well.  (Let me
    again iterate that I am NOT advocating a pure regex implementation.
    I've seen attempts at that, and they make my stomach churn.))
    
    
    > Regexps force you to jump through hoops to match what you want.
    	Any language does that; it all depends on what hoops you
    are accustomed to jumping through.
    
    > Regexps look like modem line noise and it's harder to train a
    >         chimpanzee to write regexps than a simpler pattern matching
    >         language.
    	This, sadly, is true.  It's possible to have more readable
    regexes in perl using it's m//x syntax (which I won't go into; man
    perlre if you are interested), but complex regexes are very definitely
    reader-unfriendly.  Again, I would only advocate using regexes to parse
    small portions of log messages.
    
    > Regexps don't handle case insensitivity very well (depending on the
    >         version) which means your expressions gain additional complexity
    >         in order for you to accomplish something obvious that you need
    >         to do frequently (bad!).
    	Most implementations people would use nowadays would handle
    this fine.
    
    > Regexps are not as portable as we'd like them to be - various versions
    >         crash, go non-linear, or lack features of other versions.
    	Again, this is true of most languages; it's possible to
    write a regex that won't fail to match before the heat death of the
    universe, but it's also possible to write exponential growth functions
    in other languages.  It all depends on how well you understand the tool
    in question.  (Sadly, I'd say that most people who implement "parsers"
    grok neither regexes nor parse trees well enough to do either well.)
    
    > Regexps handling of newlines is graceless in the extreme.
    	Again, this is implementation specific.
    
    > Regexps lack match scoring and rely only on the length of the match
    >         as the indicator (not the match of the template).
    	Ditto; it's not trivial, but it's very doable with the
    regex engines I've mentioned.
    
    > In short, I think people turn to regexps because they mistakenly
    > perceive them as "easier" than writing 2-3 pages of code to build
    > an efficient matching language that suits the problem at hand. ;)
    	Agreed.  And for small enough problems, the regexes probably
    suffice.
    
    > I think it's also probably the case that a lot of people want to
    > use regexps because perl offers a convenient (if slow, awkward, and
    > overcomplex) way of prototyping something. That's true, but why
    > write something that you know is going to have a shortened
    > useful lifespan just because you know the tool isn't suited for
    > the job? I never understood that logic. ;)
    	Because while you were writing those 2-3 pages of code,
    I was writing the grammar that is the real point of the exercise, so
    that I can now turn to you and say "here's a field-tested grammar in a
    well-documented form; please plug it in to your more robust engine.".  :)
    
    > Now, what's cool about this approach is that you're building
    > your parse tree on the fly.
    	And this is the other advantage of an interpreted language
    for prototyping.
    
    > I guess what I'm saying is, "please, guys, study the problem and
    > think about it a bit before you just grab perl and start throwing
    > regexps around."
    	:)  That's my response to anyone proposing any new code
    for any problem.
    
    	-- Sweth.
    
    [1] Note that I'm not saying that the perl regex engine is pretty;
    just that it is robust.  Any engine that will snip pieces out of its own
    op tree at runtime scares me, but since I know why and how it's doing
    that, I am comfortable using it.
    
    -- 
    Sweth Chandramouli      Idiopathic Systems Consulting
    svcat_private      http://www.idiopathic.net/
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: loganalysis-unsubscribeat_private
    For additional commands, e-mail: loganalysis-helpat_private
    



    This archive was generated by hypermail 2b30 : Wed Jun 05 2002 - 10:56:46 PDT