Re: Re[2]: [logs] Logging: World Domination

From: Kyle R. Hofmann (krhat_private)
Date: Fri Aug 23 2002 - 16:29:08 PDT

  • Next message: Marcus J. Ranum: "Re: Re[2]: [logs] Logging: World Domination"

    On Fri, 23 Aug 2002 00:29:22 -0700, Chris Adams wrote:
    > On Thursday, August 22, 2002, at 09:39 , Kyle R. Hofmann wrote:
    > >> - support for hierarchal data
    > >
    > > But the second one isn't.  We need a distinction between data that are
    > > required for every message, like timestamps, and data that are specific 
    > > to the type of message.  The ability to nest is unimportant, and that is 
    > > why XML is overkill.
    > 
    > Nesting is definitely less of a priority but I think it would be useful 
    > for having standard tags which could be used in different events to 
    > refer to things like IP packets or login requests:
    
    [ ... ]
    
    > The idea is just that it would be nice for analysis packages to be able 
    > to recognize "standard" chunks and use them for basic reporting even if 
    > they don't have domain-specific knowledge about the event. A monitor 
    > could tell which services a given user is trying to access even if it 
    > doesn't know any of the service-specific details.
    
    I agree, that would be nice.  But it occurred to me last night that we've
    all really been unclear on one thing, which is where we use our schemes.
    Let me go on a long digression while I explain what I think we're trying
    to get to.
    
    A log message is generated when a program reaches a statement like
    
    	syslog(LOG_INFO, "Hello, world!");
    
    The log message is sent to the log socket /dev/log.  syslogd (or its
    replacement; I'll just say syslogd) reads a message from /dev/log or a network
    socket.  It determines the facility and priority of the message and then
    forwards it to its destination, such as a remote syslog server, a file on the
    local host, or a database.
    
    This gives us eight interfaces:
    
                                                     write(2) +--------+
    +-------+ syslog(3) +--------+ read(2)             .----> |/var/log| \    +----+
    |Program| --------> |/dev/log| ----.              /       +--------+  `-->|Post|
    +-------+           +--------+      \> +-------+ / (varies) +--------+  .>+----+
                                           |syslogd| ---------> |database| /
             +--------------+           /> +-------+ \          +--------+
             |network socket| ---------'              \        +--------------+
             +--------------+ recvfrom(2)              `-----> |network socket|
                                                     sendto(2) +--------------+
    
    (where Post is a post processor; I ran out of horizontal room)
    
    The first interface is between the program and /dev/log.  This is a library
    call.  Since we're proposing to change the standard for log messages to
    something, either this will have to be replaced with a call that provides
    more structure, or we will have to rely on programmers to always use the
    right structure all the time.  As I see it, our options are:
    
    0) Leave syslog(3) unchanged and give up.
    
    1) Leave syslog(3) unchanged and add a reformatter between /dev/log and
       syslogd.
    
    2) Leave syslog(3) unchanged and trust the programmer to use only "new-style"
       log messages.
    
    3) Replace syslog(3), but do not significantly change the interface; trust the
       programmer to use only "new-style" log messages on the new library call.
    
    4) Replace syslog(3) with
    
    int newsyslog(struct logmsg *logmsg, void *logdata);
    
       where struct logmsg has fields like facility, priority, and a type for
       logdata.  logmsg->type would determine what sort of struct * to cast
       logdata to.  This ensures that the data will contain exactly the structures
       that we want it to, but is inextensible: Adding a new format requires
       recompiling the syslog daemon and has portability problems.
    
    5) Replace syslog(3) with
    
    int newsyslog(unsigned int logfieldcount, struct field *logfields, unsigned int varfieldcount, struct field *varfields);
    
       where logfields and varfields are arrays of structures and logfieldcount
       and varfield count determine the number of fields in logfields and
       varfields.  struct field would be
    
    struct field {
    	unsigned char *name;
    	unsigned char *value;
    	unsigned int name_length;
    	unsigned int value_length;
    };
    
       This does not ensure that we have all the data we need or want, but is
       very extensible.  (Note that doing without struct field is a bad idea.  We
       can't safely assume that there will never be any \0's in name and value.)
    
    6) Replace syslog(3) with
    
    int newsyslog(xml_t xml_logmsg);
    
       where xml_logmsg is some representation of a log message as XML.  This
       requires developing a set of XML library routines to manipulate log
       messages and using them in every program that wants to log.
    
    Of these, my favorite is (5), though I think the names logfieldcount and
    varfieldcount are too long.  (1) is a good stopgap.
    
    The second interface is between /dev/log and syslogd.  syslogd needs to read
    the message, parse the priority and facility, and forward it, so consequently
    syslogd needs to be able to read whatever format is used to write to /dev/log.
    Clearly parsing is easiest if every message is in a regular format, and in
    the long run we'd prefer to use that format directly instead of passing
    through an intermediate format.  Thus it seems to me that syslogd should read
    exactly one format, a textual representation of (4), (5), or (6) above, and
    syslogd should depend upon a reformatter for messages not in that format.
    It is possible for syslogd to depend on reformatter at its output end, but
    it is possible that there will be no reformatter or that the protocol used
    for output makes use of a reformatter very hard (such as when writing to a
    database).
    
    The third interface is between a remote host sending a message and syslogd.
    This should obviously be TCP instead of the traditional UDP.  To minimize the
    amount of code, this should use the same format as syslogd does for /dev/log,
    and syslogd should parse it in the same way.  There is the possibility that
    the remote host is using an old syslog daemon which is unaware of the new
    format.  In this case, the simplest solution is for the new syslogd to forward
    the message to its local reformatter and parse the reformatted message.
    
    The fourth interface is between syslogd and a local flat file.  The simplest
    solution is for syslogd to write the same textual representation that it uses
    for reading from /dev/log and from network sockets.  This has the advantage
    of making it very easy to later reread all the local flat files, pipe them
    through syslogd again, and enter them all into a database.
    
    The fifth interface is between syslogd and a database of some sort.  The
    database probably imposes more restrictions on the format of the stored data
    than syslogd does since we haven't settled on a standard log format.
    Consequently the type of database syslogd should be able to log to imposes
    at least one of:
    
    1) syslogd can only accept data formatted in a manner that corresponds to
       a format that the database stores easily,
    2) syslogd must heavily reformat data that does not correspond to a format
       that the database stores easily, or
    3) the data stored by syslogd might not be easily searched and indexed by
       the database.
    
    Since a database is designed to aid in searching and indexing, we never want
    compromise (3).  That leaves (1) and (2), and (1) is clearly far faster and
    has much less risk of accidental data corruption or loss.
    
    Thus if we decide that we want to use a database like, say, Oracle, we've
    limited our choices for the first interface to (4), (5), or a reformatted and
    restricted (6).  (4) has the advantage of being so structured that syslogd's
    work is easy, but it is still inextensible.  (5) introduces the possibility
    that the programmer may specify a field which does not correspond to any
    column in the database.  (6) has the same disadvantage as (5) and still
    requires an XML library in every logging program.
    
    If we decide, on the other hand, to depend on XML entirely and to store
    everything as XML using XML-specific tools, we've limited our choices to a
    reformatted (4), a reformatted (5), or (6).  In this case (6) is by far the
    best option.
    
    The sixth interface is between syslogd and a remote syslog host.  If syslogd
    speaks the new format, and that format is unintelligible to the remote host,
    our logs might get dropped on the floor.  Thus syslogd either requires one
    of:
    
    0) Incompatibility,
    1) Careful configuration and the ability to speak a little bit of the old
       format,
    2) A new format which is backwards compatible with the old, or
    3) A new format which is always sent and received over the network in a
       format which is backwards compatible with the old, but which is different
       the rest of the time.
    
    (2) feels like it might be needlessly limiting, and (3) is ugly.  I like (1).
    
    The seventh and eight interfaces are from flat files and databases to a post
    processor, respectively.  For a flat file, this is the format of the messages
    on disk, and for a database, this is the structure of the database.  Making
    these fast and easy to parse is not, I think, a problem for field="value" or
    for XML.
    
    Let me pop off the stack and return to what I first said.  So, where exactly
    do you want to use XML, and where exactly do I want to use field="value"?  I
    want to use interface (5) for syslog(3), field="value" for all the other
    interfaces mentioned above, and Oracle or something like it as a database.
    I have the impression that you want to use one of interfaces (2), (3), or (6)
    for syslog(3), XML for all the other interfaces mentioned above, and something
    XML-specific as a database.  Is this correct?
    
    > Another use for nesting is basically a namespace issue. I'd prefer to 
    > have, say, a standard IDS message and all of the vendor-specific stuff 
    > in a <$vendor> subtag rather than a bunch of vendor-prefixed tag names.
    
    So would I, but that's what standards like IDMEF are for.  It is impossible
    to force every vendor to report what you want the way you want it; the most
    you can do is ask, and if they don't agree, you'll have to reformat their log
    messages.  And furthermore this is true of any formatting scheme, including
    field="value".
    
    This is an area where XML seems to have an advantage because of the DTD, but
    I'm not so sure that it's as much of an advantage as it might look like,
    because to allow truly flexible logging, you must let the vendor define his
    own DTD.  This is necessary, in fact, if you're implementing a new protocol
    or service of some sort.  And that lets the vendor get away with the same
    vendor-prefixed tags that you'd prefer he not have.
    
    Moreover, from a formal perspective, the right way to let vendors have their
    own DTDs is to nest them within messages, for example, <!DOCTYPE LOGMSG> ...
    <!DOCTYPE ESMTP> ...  </DOCTYPE> ... </DOCTYPE>.  But [XML] sections 2.1, 2.8
    has:
    
    [1]     document  ::=    prolog element Misc*
    [22]    prolog    ::=    XMLDecl? Misc* (doctypedecl Misc*)?
    
    which implies that you'd have to make the nested documents some sort of
    literal data and start up new instances of your XML parser for each nesting.
    The easiest way to do this seems to be to start a <![CDATA[ section each time.
    If you have multiple nested CDATA sections, the terminating ]]> of each inner
    CDATA section would have to be quoted, first as ]]&gt;, then as ]]&amp;gt;,
    then as ]]&amp;amp;gt;, and so on, which is ugly.  I'm not an XML expert,
    though, so hopefully there's a better way.
    
    > The other area where nesting feels more natural is dumping more complex 
    > data - things like RPC calls or SSL negotiation:
    
    Yes, and that brings up a really good question: How many log messages are
    complex enough to make XML useful, and how many are not?  Something like an
    NTP time reset is too simple to need XML, but an SSL negotiation is complex.
    Especially pertinent, I think, is the fact that the SSL negotiation may
    involve an arbitrarily long chain of certificates, which XML would handle
    easily, while field="value" would not.
    
    I'd be willing to admit that field="value" is the wrong choice if there are
    a lot of possibly useful log messages that are too complex for it to handle.
    Unfortunately, I don't think that's likely, I like the idea of being able to
    use standard databases, and I'm wary of the ability of XML to handle huge
    amounts of data efficiently, especially for post-processing.
    
    > > And furthermore, we'd prefer to avoid "Message delivered
    > > successfully" because that's a freeform string, so ideally all the tags
    > > would be empty.
    > 
    > Presumably we'd define a DTD which would make any tags which could be 
    > empty on successful transactions optional.
    
    Not optional, because then successful transactions would never be recorded,
    which we might not want.  Better would be to make them empty, e.g., <TAG/>
    ([XML], 3.1).
    
    -- 
    Kyle R. Hofmann <krhat_private>
    _______________________________________________
    LogAnalysis mailing list
    LogAnalysisat_private
    http://lists.shmoo.com/mailman/listinfo/loganalysis
    



    This archive was generated by hypermail 2b30 : Fri Aug 23 2002 - 23:46:08 PDT