[This is getting increasingly off-topic, except inasmuch as it will probably affect the final outcome of any generic log parsing effort; this will be my last post to the list on this subthread, then.] On Wed, Jun 05, 2002 at 09:08:44AM -0400, Marcus J. Ranum wrote: > >I don't know if this helps, but the Addamark LMS uses perl5 regular > > expressions to hack up the log into fields > > My experience is that regexps are absolutely the wrong way to > go about log parsing. Agreed, to a certain extent. Actual regexes (that is, ones that are "regular", in the set theory sense that gave them their name) _can't_ parse log messages, and while extended regexes like perl provides _can_ do it, few enough people really understand how they work that, beyond a certain level of complexity, they invariably get them wrong in one way or another. That said, I don't have the same religious objections to regexes that you seem to; with a robust [1] regex implementation (like Perl's, or the Java ORO library). > The approach I was working on relied on correct matching of > combinations of space and non-space. Regexps are really a pain > in the butt if you want to match on whitespace. You need to use > something like: " *" oops wait there could be "[ \t]*" and oops > you can't handle newlines right... Eeeew... Regexps are a good > tool for simple searching - they're not a good tool for simple > parsing. Here's where we disagree most. For simple parsing, I think there's little better than a well-understood regex engine. The two I mentioned earlier are steller for things like you are describing, with macros like "\s" to match whitespace (and a flag to allow that to include newlines if dealing with a multiline pattern space), and it's trivial to to set case-insensitivity for either an entire regex ("/your_regex_here/i") or a small portion of it ("/your_(?i:regex)_here/"). > They're really not a good tool for complex parsing. This I'd agree with, but only because my definition of complex parsing is probably more complex than that of the average bear. I'd definitely include the parsing of an entire arbitrary log message in that definition, but I wouldn't include the parsing of a simple component of a known log message. (I understand your argument that putting a recursive parser at the bottom of a recursive parse tree is rather painful, but when the pattern being matched is simple (for some definition of simple that I won't provide but will cop out and say that people will, like pornography, know when they see it), the issues become nonexistant.) > still cost less than using a regexp AND be more reliable. Let > me explain reliability scoring (something regexps don't have) - Scoring is a huge plus, but I think it's orthogonal to what I'm proposing as a first step; there's nothing to say that a person implementing a particular log parser couldn't given the log message grammar repository that I'm now proposing, choose to provide scoring hooks for how well a node matches a particular portion of the message being parsed. (Again, things like this are trivial with the Parse::RecDescent module in Perl, and while not necessarily trivial, they _are_ feasible with a pure regex implementation as well. (Let me again iterate that I am NOT advocating a pure regex implementation. I've seen attempts at that, and they make my stomach churn.)) > Regexps force you to jump through hoops to match what you want. Any language does that; it all depends on what hoops you are accustomed to jumping through. > Regexps look like modem line noise and it's harder to train a > chimpanzee to write regexps than a simpler pattern matching > language. This, sadly, is true. It's possible to have more readable regexes in perl using it's m//x syntax (which I won't go into; man perlre if you are interested), but complex regexes are very definitely reader-unfriendly. Again, I would only advocate using regexes to parse small portions of log messages. > Regexps don't handle case insensitivity very well (depending on the > version) which means your expressions gain additional complexity > in order for you to accomplish something obvious that you need > to do frequently (bad!). Most implementations people would use nowadays would handle this fine. > Regexps are not as portable as we'd like them to be - various versions > crash, go non-linear, or lack features of other versions. Again, this is true of most languages; it's possible to write a regex that won't fail to match before the heat death of the universe, but it's also possible to write exponential growth functions in other languages. It all depends on how well you understand the tool in question. (Sadly, I'd say that most people who implement "parsers" grok neither regexes nor parse trees well enough to do either well.) > Regexps handling of newlines is graceless in the extreme. Again, this is implementation specific. > Regexps lack match scoring and rely only on the length of the match > as the indicator (not the match of the template). Ditto; it's not trivial, but it's very doable with the regex engines I've mentioned. > In short, I think people turn to regexps because they mistakenly > perceive them as "easier" than writing 2-3 pages of code to build > an efficient matching language that suits the problem at hand. ;) Agreed. And for small enough problems, the regexes probably suffice. > I think it's also probably the case that a lot of people want to > use regexps because perl offers a convenient (if slow, awkward, and > overcomplex) way of prototyping something. That's true, but why > write something that you know is going to have a shortened > useful lifespan just because you know the tool isn't suited for > the job? I never understood that logic. ;) Because while you were writing those 2-3 pages of code, I was writing the grammar that is the real point of the exercise, so that I can now turn to you and say "here's a field-tested grammar in a well-documented form; please plug it in to your more robust engine.". :) > Now, what's cool about this approach is that you're building > your parse tree on the fly. And this is the other advantage of an interpreted language for prototyping. > I guess what I'm saying is, "please, guys, study the problem and > think about it a bit before you just grab perl and start throwing > regexps around." :) That's my response to anyone proposing any new code for any problem. -- Sweth. [1] Note that I'm not saying that the perl regex engine is pretty; just that it is robust. Any engine that will snip pieces out of its own op tree at runtime scares me, but since I know why and how it's doing that, I am comfortable using it. -- Sweth Chandramouli Idiopathic Systems Consulting svcat_private http://www.idiopathic.net/ --------------------------------------------------------------------- To unsubscribe, e-mail: loganalysis-unsubscribeat_private For additional commands, e-mail: loganalysis-helpat_private
This archive was generated by hypermail 2b30 : Wed Jun 05 2002 - 10:56:46 PDT