If you go to http://www.ranum.com/logging/logging-data-map.html I've posted the first version of a token glossary and format that Paul Robertson and I developed for the now-defunct Fargo project. I no longer have any examples of parsed-out records produced by Fargo, so it's hard to illustrate them. On the other hand, the data map is pretty straightforward and quite useable as is. If you had one logging system recording data in accordance with this map you'd be able to trivially translate it to another. The approach Fargo was taking for tokenizing was to identify known elements from the glossary and break them out into a "pseudo-XML" - something XML-like enough that an XML parser would probably work fine on it, but simple enough to get the job done efficiently. Probably the most important thing you'll notice about the layout is that we didn't feel it was possible to tightly specify everything. In fact, we concluded that it's a BAD IDEA to tightly specify everything. So we came up with buckets into which a variety of things can be stored. Take for example SRCDEV - source device identifier: it might be any of a host name, an IP address, a MAC address, or even a physical device in kernel space ("wd0c") - but it's useful because then you can still correlate on SRCDEV and sort/search without having to know the specific type of data it happens to be. One important side-effect of this design decision is that the fields are UN-TYPED. So the parser usually will treat everything as a string and nothing more. Which is also valuable because now you can lexically sort SRCDEV and "wd0c" will come out at the bottom and all the IP addresses will cluster by network range. Treating everything as strings has some very good properties in that regard. Though if one place you log SRCDEV=10.10.10.111 and another SRCDEV=iorek.ranum.com you have the same value in there with 2 different representations. The only conclusion we made there was that if you cared, you could write a pre-processor that walked SRCDEV and tried to re-parse anything that looked like a MAC address against an ARP table, or a host name against a DNS lookup. The intent here was to get the data as close as possible to in the correct "bucket" and let people who want to post-process or pre-process it even more thoroughly be able to. By bucketing stuff loosely you can do fun queries like search where SRCDEV = TARGDEV and it'll do the "right thing" whether the sources are hard disks or IP addresses - and they won't tend to "jump across" types since lexically the format for hard devices is usually not lexically close to an IP address. "wd0" != "10.10.10.111" We also figured this would be very useful and fun for close matching/fuzzy matching routines - try to see if "10.10.10" is within 5% of "10.10.10.111" etc. A few assumptions that are hidden in the layout: 1) We recognize that XML adds considerable markup to the logs and would increase size. It is assumed that compression is being applied to logs but we leave that as an exercise for the reader. (Fargo handled compression as an offline process) Compression should address (and then some!) the text-bloat caused by XML as well as the duplication of some elements caused by tokenizing. 2) We used normalized dates (ISO 8601) - doing this almost guarantees that "original date" timestamps need to be kept in their own field in case a mapping fails. 3) For forensics/evidentiary purposes Fargo kept a complete copy of the ORIGINAL log message, untouched, in a field called RAWMSG - optionally - which is one reason why compression was considered a "must" 4) We used Snort's priority rating scheme because it seems pretty decent. mjr. --- Marcus J. Ranum http://www.ranum.com Computer and Communications Security mjrat_private _______________________________________________ LogAnalysis mailing list LogAnalysisat_private https://lists.shmoo.com/mailman/listinfo/loganalysis
This archive was generated by hypermail 2b30 : Wed Aug 21 2002 - 15:00:24 PDT