[logs] tokens and layouts...

mjrat_private

If you go to
http://www.ranum.com/logging/logging-data-map.html
I've posted the first version of a token glossary and format
that Paul Robertson and I developed for the now-defunct Fargo project.
I no longer have any examples of parsed-out records produced by
Fargo, so it's hard to illustrate them. On the other hand, the
data map is pretty straightforward and quite useable as is. If
you had one logging system recording data in accordance with this
map you'd be able to trivially translate it to another.

The approach Fargo was taking for tokenizing was to identify
known elements from the glossary and break them out into a
"pseudo-XML" - something XML-like enough that an XML parser
would probably work fine on it, but simple enough to get the
job done efficiently.

Probably the most important thing you'll notice about the
layout is that we didn't feel it was possible to tightly
specify everything. In fact, we concluded that it's a BAD IDEA
to tightly specify everything. So we came up with buckets into
which a variety of things can be stored. Take for example
SRCDEV - source device identifier: it might be any of a
host name, an IP address, a MAC address, or even a physical
device in kernel space ("wd0c") - but it's useful because
then you can still correlate on SRCDEV and sort/search without
having to know the specific type of data it happens to be.
One important side-effect of this design decision is that
the fields are UN-TYPED. So the parser usually will treat
everything as a string and nothing more. Which is also valuable
because now you can lexically sort SRCDEV and "wd0c" will come
out at the bottom and all the IP addresses will cluster by
network range. Treating everything as strings has some very
good properties in that regard. Though if one place you log
SRCDEV=10.10.10.111
and another
SRCDEV=iorek.ranum.com
you have the same value in there with 2 different representations.
The only conclusion we made there was that if you cared, you
could write a pre-processor that walked SRCDEV and tried to
re-parse anything that looked like a MAC address against an
ARP table, or a host name against a DNS lookup. The intent here
was to get the data as close as possible to in the correct "bucket"
and let people who want to post-process or pre-process it even
more thoroughly be able to.

By bucketing stuff loosely you can do fun queries like
search where SRCDEV = TARGDEV
and it'll do the "right thing" whether the sources are hard
disks or IP addresses - and they won't tend to "jump across"
types since lexically the format for hard devices is usually
not lexically close to an IP address. "wd0" != "10.10.10.111"
We also figured this would be very useful and fun for
close matching/fuzzy matching routines - try to see if
"10.10.10" is within 5% of "10.10.10.111" etc.

A few assumptions that are hidden in the layout:
1) We recognize that XML adds considerable markup to the logs and
	would increase size. It is assumed that compression is
	being applied to logs but we leave that as an exercise
	for the reader. (Fargo handled compression as an offline
	process) Compression should address (and then some!) the
	text-bloat caused by XML as well as the duplication of
	some elements caused by tokenizing.
2) We used normalized dates (ISO 8601) - doing this almost
	guarantees that "original date" timestamps need to be
	kept in their own field in case a mapping fails.
3) For forensics/evidentiary purposes Fargo kept a complete copy
	of the ORIGINAL log message, untouched, in a field called
	RAWMSG - optionally - which is one reason why compression
	was considered a "must"
4) We used Snort's priority rating scheme because it seems pretty
	decent.

mjr.
---
Marcus J. Ranum				http://www.ranum.com
Computer and Communications Security	mjrat_private

_______________________________________________
LogAnalysis mailing list
LogAnalysisat_private
https://lists.shmoo.com/mailman/listinfo/loganalysis