>I don't know if this helps, but the Addamark LMS uses perl5 regular > expressions to hack up the log into fields My experience is that regexps are absolutely the wrong way to go about log parsing. Consider that regexps _are_ a parser (just a bad one!) and ask yourself what happens if, at the terminal nodes of a recursive descent parser, you have _another_ recursive descent parser. That just doesn't make any sense!!! The results you get by using regexps aren't predictable enough. To "fix" that you'll wind up doing match-scoring tricks (if you care) to differentiate between: "S[uU]: .*" and ".*" So that's a contrived example but it brings out a couple of important things I noticed about log parsing: - case doesn't matter: make everything case-insensitive - wildcards are useless: whitespace is everything The approach I was working on relied on correct matching of combinations of space and non-space. Regexps are really a pain in the butt if you want to match on whitespace. You need to use something like: " *" oops wait there could be "[ \t]*" and oops you can't handle newlines right... Eeeew... Regexps are a good tool for simple searching - they're not a good tool for simple parsing. They're really not a good tool for complex parsing. So let's look at another approach and then I'll generalize it to a broader approach... First, assume you're using a matching language that looks like: %s = matches string %6s = matches 6 chars in a string %d = matches number (including negative numbers) %f = matches a floating point # %w = matches whitespace explicitly (e.g.: %5w = matches 5 spaces or tabs) (whitespace) matches whitespace - any quantity newlines can be literal embedded in the match string, which greatly helps readability... and everything else is a literal. Everything is case insensitive. If you think about implementing something like this in C you can guess it's about 3 pages of code - 2 if you get tricky. It'll run extremely fast. Now you match using as spec like: "Su: BADSU %s (tty %s)" which won't match everything but if your matching routine is fast and efficient enough you can use 3 or 4 rules and it'll still cost less than using a regexp AND be more reliable. Let me explain reliability scoring (something regexps don't have) - treat each item in the match string as a unique item worth 1 point when you match it. So, the pattern: "Su: BADSU %s (tty %s)" contains 19 items. "Su:" is 3. "Su: " is 4 (the whitespace) etc. If you mis-match anything the score drops to zero. If you get to the end of the match string, the match string with the most SPECIFIC match wins. Consider: "%s: %s %s (tty %s)" That's 14 items - even though it'd also match our BADSU message it's not worth as much as the more specific match string. So it's less likely to get picked. Remember - if you're doing "log parsing" (whatever that is) you do NOT WANT something that will accidentally grab BAD LOGIN messages and stuff them into BADSU fields because someone forgot to add a "[ \t]*" someplace. You want it to be readable and as specific as possible. That way if a message comes in that does not have a specific matching rule it'll fall out the bottom of the system and someone will know to write a rule for it. Otherwise you'll end your rules with a "^.*$" and you've got a very cool log parser that unfortunately parses garbage. Regexps force you to jump through hoops to match what you want. Regexps look like modem line noise and it's harder to train a chimpanzee to write regexps than a simpler pattern matching language. Regexps don't handle case insensitivity very well (depending on the version) which means your expressions gain additional complexity in order for you to accomplish something obvious that you need to do frequently (bad!). Regexps are not as portable as we'd like them to be - various versions crash, go non-linear, or lack features of other versions. Regexps handling of newlines is graceless in the extreme. Regexps lack match scoring and rely only on the length of the match as the indicator (not the match of the template). In short, I think people turn to regexps because they mistakenly perceive them as "easier" than writing 2-3 pages of code to build an efficient matching language that suits the problem at hand. ;) I think it's also probably the case that a lot of people want to use regexps because perl offers a convenient (if slow, awkward, and overcomplex) way of prototyping something. That's true, but why write something that you know is going to have a shortened useful lifespan just because you know the tool isn't suited for the job? I never understood that logic. ;) Now let's look at advanced parsing techniques... If you're going to play with parsing, you _HAVE_ to read the Aho, Weinberger, Kernighan book on compilers. The "Dragon" book. It's got some great explanations of recursive descent techniques and how to build parsers. Understanding what yacc does is key, because it's fast, flexible, and wonderful. ;) Regexps are basically mini recursive-descent parsers, FYI. Anyhow, what you want to be able to do (what I was working on before I had to quit...) is write a BNF notation for logs. It needn't be complex but consider something that looks like: datetime: "%d:%d:%d" { year=$1 month=$2 day=$3 hour="?" min="?" } "%d %d %d: %d %d" { year=$1 month=$2 day=$3 hour=$4 min=$5 } $datetime "badsu: %s (tty %s)" $datetime "sendmail blah blah" Ok, now what you've done is defined a node for "datetime" and made several higher level productions depend on it. Now you can specify either a tuned date-time format for your machine or have several and let the parser pick the one that fits the best. Access the fields in $datetime as: $datetime.hour or whatever. Now, what's cool about this approach is that you're building your parse tree on the fly. The first thing (in the example above) after datetime is either a literal "sendmail" or "su:" so you can build a prefix unbalanced n-way decision tree that lets you match _anything_ against an arbitrary sized set of matching rules without ever having to check more than on character of mis-match. Fast? Oh, yeah. It'll take you more than 3 pages of code, though, but the data structures aren't hard and the value of being able to load as many rules as you like into the system without slowing it significantly makes it worth the effort. I guess what I'm saying is, "please, guys, study the problem and think about it a bit before you just grab perl and start throwing regexps around." mjr. --------------------------------------------------------------------- To unsubscribe, e-mail: loganalysis-unsubscribeat_private For additional commands, e-mail: loganalysis-helpat_private
This archive was generated by hypermail 2b30 : Wed Jun 05 2002 - 08:10:18 PDT