neat-- thanks for the detailed reply, and I agree! Regexps are a huge pain, but they're the lowest common denominator, which is why v1.0 uses them... I'd *love* to see something better, for which we can add support in the LMS (or a converter tool), then run the sucker in parallel. one thing: LR(1) parsers are not a panacea-- yacc replaces the "scoring" problem with horrible shift-reduce conflicts between subtly conficting rules. Having "done it for dollars" writing commercial compilers, I can attest that lex+yacc is not so "wonderful" in practice. Maybe logs won't exhibit these problems... adam > >I don't know if this helps, but the Addamark LMS uses perl5 regular > > expressions to hack up the log into fields > > My experience is that regexps are absolutely the wrong way to > go about log parsing. Consider that regexps _are_ a parser (just > a bad one!) and ask yourself what happens if, at the terminal > nodes of a recursive descent parser, you have _another_ recursive > descent parser. That just doesn't make any sense!!! The results > you get by using regexps aren't predictable enough. To "fix" that > you'll wind up doing match-scoring tricks (if you care) to > differentiate between: > "S[uU]: .*" > and > ".*" > > So that's a contrived example but it brings out a couple of > important things I noticed about log parsing: > - case doesn't matter: make everything case-insensitive > - wildcards are useless: whitespace is everything > > The approach I was working on relied on correct matching of > combinations of space and non-space. Regexps are really a pain > in the butt if you want to match on whitespace. You need to use > something like: " *" oops wait there could be "[ \t]*" and oops > you can't handle newlines right... Eeeew... Regexps are a good > tool for simple searching - they're not a good tool for simple > parsing. They're really not a good tool for complex parsing. > > So let's look at another approach and then I'll generalize it > to a broader approach... First, assume you're using a matching > language that looks like: > %s = matches string > %6s = matches 6 chars in a string > %d = matches number (including negative numbers) > %f = matches a floating point # > %w = matches whitespace explicitly (e.g.: %5w = matches 5 spaces > or tabs) > (whitespace) matches whitespace - any quantity > newlines can be literal embedded in the match string, which > greatly helps readability... > > and everything else is a literal. Everything is case insensitive. > If you think about implementing something like this in C you can > guess it's about 3 pages of code - 2 if you get tricky. It'll run > extremely fast. > > Now you match using as spec like: > "Su: BADSU %s (tty %s)" > which won't match everything but if your matching routine is > fast and efficient enough you can use 3 or 4 rules and it'll > still cost less than using a regexp AND be more reliable. Let > me explain reliability scoring (something regexps don't have) - > treat each item in the match string as a unique item worth > 1 point when you match it. So, the pattern: > "Su: BADSU %s (tty %s)" > contains 19 items. "Su:" is 3. "Su: " is 4 (the whitespace) etc. > If you mis-match anything the score drops to zero. If you get to > the end of the match string, the match string with the most > SPECIFIC match wins. Consider: > "%s: %s %s (tty %s)" > That's 14 items - even though it'd also match our BADSU message > it's not worth as much as the more specific match string. So it's > less likely to get picked. > > Remember - if you're doing "log parsing" (whatever that is) you > do NOT WANT something that will accidentally grab BAD LOGIN messages > and stuff them into BADSU fields because someone forgot to add > a "[ \t]*" someplace. You want it to be readable and as specific > as possible. That way if a message comes in that does not have a > specific matching rule it'll fall out the bottom of the system > and someone will know to write a rule for it. Otherwise you'll > end your rules with a > "^.*$" > and you've got a very cool log parser that unfortunately parses > garbage. > > Regexps force you to jump through hoops to match what you want. > Regexps look like modem line noise and it's harder to train a > chimpanzee to write regexps than a simpler pattern matching > language. > Regexps don't handle case insensitivity very well (depending on the > version) which means your expressions gain additional complexity > in order for you to accomplish something obvious that you need > to do frequently (bad!). > Regexps are not as portable as we'd like them to be - various versions > crash, go non-linear, or lack features of other versions. > Regexps handling of newlines is graceless in the extreme. > Regexps lack match scoring and rely only on the length of the match > as the indicator (not the match of the template). > > In short, I think people turn to regexps because they mistakenly > perceive them as "easier" than writing 2-3 pages of code to build > an efficient matching language that suits the problem at hand. ;) > I think it's also probably the case that a lot of people want to > use regexps because perl offers a convenient (if slow, awkward, and > overcomplex) way of prototyping something. That's true, but why > write something that you know is going to have a shortened > useful lifespan just because you know the tool isn't suited for > the job? I never understood that logic. ;) > > Now let's look at advanced parsing techniques... > > If you're going to play with parsing, you _HAVE_ to read the > Aho, Weinberger, Kernighan book on compilers. The "Dragon" book. > It's got some great explanations of recursive descent techniques > and how to build parsers. Understanding what yacc does is key, > because it's fast, flexible, and wonderful. ;) Regexps are > basically mini recursive-descent parsers, FYI. Anyhow, what > you want to be able to do (what I was working on before I had to > quit...) is write a BNF notation for logs. It needn't be complex > but consider something that looks like: > > datetime: > "%d:%d:%d" > { > year=$1 > month=$2 > day=$3 > hour="?" > min="?" > } > "%d %d %d: %d %d" > { > year=$1 > month=$2 > day=$3 > hour=$4 > min=$5 > } > > $datetime "badsu: %s (tty %s)" > $datetime "sendmail blah blah" > > Ok, now what you've done is defined a node for "datetime" and > made several higher level productions depend on it. Now you > can specify either a tuned date-time format for your machine > or have several and let the parser pick the one that fits the > best. Access the fields in $datetime as: $datetime.hour or > whatever. > > Now, what's cool about this approach is that you're building > your parse tree on the fly. The first thing (in the example > above) after datetime is either a literal "sendmail" or "su:" > so you can build a prefix unbalanced n-way decision tree that > lets you match _anything_ against an arbitrary sized set of > matching rules without ever having to check more than on character > of mis-match. Fast? Oh, yeah. It'll take you more than 3 pages > of code, though, but the data structures aren't hard and the > value of being able to load as many rules as you like into the > system without slowing it significantly makes it worth the effort. > > I guess what I'm saying is, "please, guys, study the problem and > think about it a bit before you just grab perl and start throwing > regexps around." > > mjr. > --------------------------------------------------------------------- To unsubscribe, e-mail: loganalysis-unsubscribeat_private For additional commands, e-mail: loganalysis-helpat_private
This archive was generated by hypermail 2b30 : Wed Jun 05 2002 - 08:09:28 PDT