[logs] tokens and layouts...

From: Marcus J. Ranum (mjrat_private)
Date: Wed Aug 21 2002 - 07:19:12 PDT

  • Next message: Mike Messick: "[logs] Thoughts on log normalization"

    If you go to
    http://www.ranum.com/logging/logging-data-map.html
    I've posted the first version of a token glossary and format
    that Paul Robertson and I developed for the now-defunct Fargo project.
    I no longer have any examples of parsed-out records produced by
    Fargo, so it's hard to illustrate them. On the other hand, the
    data map is pretty straightforward and quite useable as is. If
    you had one logging system recording data in accordance with this
    map you'd be able to trivially translate it to another.
    
    The approach Fargo was taking for tokenizing was to identify
    known elements from the glossary and break them out into a
    "pseudo-XML" - something XML-like enough that an XML parser
    would probably work fine on it, but simple enough to get the
    job done efficiently.
    
    Probably the most important thing you'll notice about the
    layout is that we didn't feel it was possible to tightly
    specify everything. In fact, we concluded that it's a BAD IDEA
    to tightly specify everything. So we came up with buckets into
    which a variety of things can be stored. Take for example
    SRCDEV - source device identifier: it might be any of a
    host name, an IP address, a MAC address, or even a physical
    device in kernel space ("wd0c") - but it's useful because
    then you can still correlate on SRCDEV and sort/search without
    having to know the specific type of data it happens to be.
    One important side-effect of this design decision is that
    the fields are UN-TYPED. So the parser usually will treat
    everything as a string and nothing more. Which is also valuable
    because now you can lexically sort SRCDEV and "wd0c" will come
    out at the bottom and all the IP addresses will cluster by
    network range. Treating everything as strings has some very
    good properties in that regard. Though if one place you log
    SRCDEV=10.10.10.111
    and another
    SRCDEV=iorek.ranum.com
    you have the same value in there with 2 different representations.
    The only conclusion we made there was that if you cared, you
    could write a pre-processor that walked SRCDEV and tried to
    re-parse anything that looked like a MAC address against an
    ARP table, or a host name against a DNS lookup. The intent here
    was to get the data as close as possible to in the correct "bucket"
    and let people who want to post-process or pre-process it even
    more thoroughly be able to.
    
    By bucketing stuff loosely you can do fun queries like
    search where SRCDEV = TARGDEV
    and it'll do the "right thing" whether the sources are hard
    disks or IP addresses - and they won't tend to "jump across"
    types since lexically the format for hard devices is usually
    not lexically close to an IP address. "wd0" != "10.10.10.111"
    We also figured this would be very useful and fun for
    close matching/fuzzy matching routines - try to see if
    "10.10.10" is within 5% of "10.10.10.111" etc.
    
    A few assumptions that are hidden in the layout:
    1) We recognize that XML adds considerable markup to the logs and
    	would increase size. It is assumed that compression is
    	being applied to logs but we leave that as an exercise
    	for the reader. (Fargo handled compression as an offline
    	process) Compression should address (and then some!) the
    	text-bloat caused by XML as well as the duplication of
    	some elements caused by tokenizing.
    2) We used normalized dates (ISO 8601) - doing this almost
    	guarantees that "original date" timestamps need to be
    	kept in their own field in case a mapping fails.
    3) For forensics/evidentiary purposes Fargo kept a complete copy
    	of the ORIGINAL log message, untouched, in a field called
    	RAWMSG - optionally - which is one reason why compression
    	was considered a "must"
    4) We used Snort's priority rating scheme because it seems pretty
    	decent.
    
    
    mjr.
    ---
    Marcus J. Ranum				http://www.ranum.com
    Computer and Communications Security	mjrat_private
    
    _______________________________________________
    LogAnalysis mailing list
    LogAnalysisat_private
    https://lists.shmoo.com/mailman/listinfo/loganalysis
    



    This archive was generated by hypermail 2b30 : Wed Aug 21 2002 - 15:00:24 PDT