Re: [logs] Re: Generic Log Message Parsing Tool

From: Sweth Chandramouli (loganalysisat_private)
Date: Wed Jun 05 2002 - 10:42:49 PDT

Next message: yehuda: "RE: [logs] Re: Generic Log Message Parsing Tool"

Previous message: Sweth Chandramouli: "Re: [logs] Re: Generic Log Message Parsing Tool"
In reply to: Marcus J. Ranum: "Re: [logs] Re: Generic Log Message Parsing Tool"
Next in thread: Tina Bird: "Re: [logs] Re: Generic Log Message Parsing Tool"
Next in thread: Marcus J. Ranum: "Re: [logs] Re: Generic Log Message Parsing Tool"
Reply: Tina Bird: "Re: [logs] Re: Generic Log Message Parsing Tool"
Reply: Marcus J. Ranum: "Re: [logs] Re: Generic Log Message Parsing Tool"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

	[This is getting increasingly off-topic, except inasmuch
as it will probably affect the final outcome of any generic log parsing
effort; this will be my last post to the list on this subthread, then.]

On Wed, Jun 05, 2002 at 09:08:44AM -0400, Marcus J. Ranum wrote:
> >I don't know if this helps, but the Addamark LMS uses perl5 regular
> >   expressions to hack up the log into fields
> 
> My experience is that regexps are absolutely the wrong way to
> go about log parsing.
	Agreed, to a certain extent.  Actual regexes (that is,
ones that are "regular", in the set theory sense that gave them their
name) _can't_ parse log messages, and while extended regexes like perl
provides _can_ do it, few enough people really understand how they work
that, beyond a certain level of complexity, they invariably get them
wrong in one way or another.  That said, I don't have the same religious
objections to regexes that you seem to; with a robust [1] regex
implementation (like Perl's, or the Java ORO library).

> The approach I was working on relied on correct matching of
> combinations of space and non-space. Regexps are really a pain
> in the butt if you want to match on whitespace. You need to use
> something like: " *" oops wait there could be "[ \t]*" and oops
> you can't handle newlines right... Eeeew...  Regexps are a good
> tool for simple searching - they're not a good tool for simple
> parsing.
	Here's where we disagree most.  For simple parsing, I
think there's little better than a well-understood regex engine.  The
two I mentioned earlier are steller for things like you are describing,
with macros like "\s" to match whitespace (and a flag to allow that to
include newlines if dealing with a multiline pattern space), and it's
trivial to to set case-insensitivity for either an entire regex
("/your_regex_here/i") or a small portion of it
("/your_(?i:regex)_here/").

> They're really not a good tool for complex parsing.
	This I'd agree with, but only because my definition of
complex parsing is probably more complex than that of the average bear.
I'd definitely include the parsing of an entire arbitrary log message in
that definition, but I wouldn't include the parsing of a simple component
of a known log message.  (I understand your argument that putting a
recursive parser at the bottom of a recursive parse tree is rather
painful, but when the pattern being matched is simple (for some
definition of simple that I won't provide but will cop out and say that
people will, like pornography, know when they see it), the issues become
nonexistant.)

> still cost less than using a regexp AND be more reliable. Let
> me explain reliability scoring (something regexps don't have) -
	Scoring is a huge plus, but I think it's orthogonal to
what I'm proposing as a first step; there's nothing to say that a
person implementing a particular log parser couldn't given the log
message grammar repository that I'm now proposing, choose to provide
scoring hooks for how well a node matches a particular portion of the
message being parsed.  (Again, things like this are trivial with the
Parse::RecDescent module in Perl, and while not necessarily trivial,
they _are_ feasible with a pure regex implementation as well.  (Let me
again iterate that I am NOT advocating a pure regex implementation.
I've seen attempts at that, and they make my stomach churn.))


> Regexps force you to jump through hoops to match what you want.
	Any language does that; it all depends on what hoops you
are accustomed to jumping through.

> Regexps look like modem line noise and it's harder to train a
>         chimpanzee to write regexps than a simpler pattern matching
>         language.
	This, sadly, is true.  It's possible to have more readable
regexes in perl using it's m//x syntax (which I won't go into; man
perlre if you are interested), but complex regexes are very definitely
reader-unfriendly.  Again, I would only advocate using regexes to parse
small portions of log messages.

> Regexps don't handle case insensitivity very well (depending on the
>         version) which means your expressions gain additional complexity
>         in order for you to accomplish something obvious that you need
>         to do frequently (bad!).
	Most implementations people would use nowadays would handle
this fine.

> Regexps are not as portable as we'd like them to be - various versions
>         crash, go non-linear, or lack features of other versions.
	Again, this is true of most languages; it's possible to
write a regex that won't fail to match before the heat death of the
universe, but it's also possible to write exponential growth functions
in other languages.  It all depends on how well you understand the tool
in question.  (Sadly, I'd say that most people who implement "parsers"
grok neither regexes nor parse trees well enough to do either well.)

> Regexps handling of newlines is graceless in the extreme.
	Again, this is implementation specific.

> Regexps lack match scoring and rely only on the length of the match
>         as the indicator (not the match of the template).
	Ditto; it's not trivial, but it's very doable with the
regex engines I've mentioned.

> In short, I think people turn to regexps because they mistakenly
> perceive them as "easier" than writing 2-3 pages of code to build
> an efficient matching language that suits the problem at hand. ;)
	Agreed.  And for small enough problems, the regexes probably
suffice.

> I think it's also probably the case that a lot of people want to
> use regexps because perl offers a convenient (if slow, awkward, and
> overcomplex) way of prototyping something. That's true, but why
> write something that you know is going to have a shortened
> useful lifespan just because you know the tool isn't suited for
> the job? I never understood that logic. ;)
	Because while you were writing those 2-3 pages of code,
I was writing the grammar that is the real point of the exercise, so
that I can now turn to you and say "here's a field-tested grammar in a
well-documented form; please plug it in to your more robust engine.".  :)

> Now, what's cool about this approach is that you're building
> your parse tree on the fly.
	And this is the other advantage of an interpreted language
for prototyping.

> I guess what I'm saying is, "please, guys, study the problem and
> think about it a bit before you just grab perl and start throwing
> regexps around."
	:)  That's my response to anyone proposing any new code
for any problem.

	-- Sweth.

[1] Note that I'm not saying that the perl regex engine is pretty;
just that it is robust.  Any engine that will snip pieces out of its own
op tree at runtime scares me, but since I know why and how it's doing
that, I am comfortable using it.

-- 
Sweth Chandramouli      Idiopathic Systems Consulting
svcat_private      http://www.idiopathic.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: loganalysis-unsubscribeat_private
For additional commands, e-mail: loganalysis-helpat_private

Next message: yehuda: "RE: [logs] Re: Generic Log Message Parsing Tool"
Previous message: Sweth Chandramouli: "Re: [logs] Re: Generic Log Message Parsing Tool"
In reply to: Marcus J. Ranum: "Re: [logs] Re: Generic Log Message Parsing Tool"
Next in thread: Tina Bird: "Re: [logs] Re: Generic Log Message Parsing Tool"
Next in thread: Marcus J. Ranum: "Re: [logs] Re: Generic Log Message Parsing Tool"
Reply: Tina Bird: "Re: [logs] Re: Generic Log Message Parsing Tool"
Reply: Marcus J. Ranum: "Re: [logs] Re: Generic Log Message Parsing Tool"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b30 : Wed Jun 05 2002 - 10:56:46 PDT