Re: Re[2]: [logs] Logging: World Domination

krhat_private

On Fri, 23 Aug 2002 00:29:22 -0700, Chris Adams wrote:
> On Thursday, August 22, 2002, at 09:39 , Kyle R. Hofmann wrote:
> >> - support for hierarchal data
> >
> > But the second one isn't.  We need a distinction between data that are
> > required for every message, like timestamps, and data that are specific 
> > to the type of message.  The ability to nest is unimportant, and that is 
> > why XML is overkill.
> 
> Nesting is definitely less of a priority but I think it would be useful 
> for having standard tags which could be used in different events to 
> refer to things like IP packets or login requests:

[ ... ]

> The idea is just that it would be nice for analysis packages to be able 
> to recognize "standard" chunks and use them for basic reporting even if 
> they don't have domain-specific knowledge about the event. A monitor 
> could tell which services a given user is trying to access even if it 
> doesn't know any of the service-specific details.

I agree, that would be nice.  But it occurred to me last night that we've
all really been unclear on one thing, which is where we use our schemes.
Let me go on a long digression while I explain what I think we're trying
to get to.

A log message is generated when a program reaches a statement like

	syslog(LOG_INFO, "Hello, world!");

The log message is sent to the log socket /dev/log.  syslogd (or its
replacement; I'll just say syslogd) reads a message from /dev/log or a network
socket.  It determines the facility and priority of the message and then
forwards it to its destination, such as a remote syslog server, a file on the
local host, or a database.

This gives us eight interfaces:

                                                 write(2) +--------+
+-------+ syslog(3) +--------+ read(2)             .----> |/var/log| \    +----+
|Program| --------> |/dev/log| ----.              /       +--------+  `-->|Post|
+-------+           +--------+      \> +-------+ / (varies) +--------+  .>+----+
                                       |syslogd| ---------> |database| /
         +--------------+           /> +-------+ \          +--------+
         |network socket| ---------'              \        +--------------+
         +--------------+ recvfrom(2)              `-----> |network socket|
                                                 sendto(2) +--------------+

(where Post is a post processor; I ran out of horizontal room)

The first interface is between the program and /dev/log.  This is a library
call.  Since we're proposing to change the standard for log messages to
something, either this will have to be replaced with a call that provides
more structure, or we will have to rely on programmers to always use the
right structure all the time.  As I see it, our options are:

0) Leave syslog(3) unchanged and give up.

1) Leave syslog(3) unchanged and add a reformatter between /dev/log and
   syslogd.

2) Leave syslog(3) unchanged and trust the programmer to use only "new-style"
   log messages.

3) Replace syslog(3), but do not significantly change the interface; trust the
   programmer to use only "new-style" log messages on the new library call.

4) Replace syslog(3) with

int newsyslog(struct logmsg *logmsg, void *logdata);

   where struct logmsg has fields like facility, priority, and a type for
   logdata.  logmsg->type would determine what sort of struct * to cast
   logdata to.  This ensures that the data will contain exactly the structures
   that we want it to, but is inextensible: Adding a new format requires
   recompiling the syslog daemon and has portability problems.

5) Replace syslog(3) with

int newsyslog(unsigned int logfieldcount, struct field *logfields, unsigned int varfieldcount, struct field *varfields);

   where logfields and varfields are arrays of structures and logfieldcount
   and varfield count determine the number of fields in logfields and
   varfields.  struct field would be

struct field {
	unsigned char *name;
	unsigned char *value;
	unsigned int name_length;
	unsigned int value_length;
};

   This does not ensure that we have all the data we need or want, but is
   very extensible.  (Note that doing without struct field is a bad idea.  We
   can't safely assume that there will never be any \0's in name and value.)

6) Replace syslog(3) with

int newsyslog(xml_t xml_logmsg);

   where xml_logmsg is some representation of a log message as XML.  This
   requires developing a set of XML library routines to manipulate log
   messages and using them in every program that wants to log.

Of these, my favorite is (5), though I think the names logfieldcount and
varfieldcount are too long.  (1) is a good stopgap.

The second interface is between /dev/log and syslogd.  syslogd needs to read
the message, parse the priority and facility, and forward it, so consequently
syslogd needs to be able to read whatever format is used to write to /dev/log.
Clearly parsing is easiest if every message is in a regular format, and in
the long run we'd prefer to use that format directly instead of passing
through an intermediate format.  Thus it seems to me that syslogd should read
exactly one format, a textual representation of (4), (5), or (6) above, and
syslogd should depend upon a reformatter for messages not in that format.
It is possible for syslogd to depend on reformatter at its output end, but
it is possible that there will be no reformatter or that the protocol used
for output makes use of a reformatter very hard (such as when writing to a
database).

The third interface is between a remote host sending a message and syslogd.
This should obviously be TCP instead of the traditional UDP.  To minimize the
amount of code, this should use the same format as syslogd does for /dev/log,
and syslogd should parse it in the same way.  There is the possibility that
the remote host is using an old syslog daemon which is unaware of the new
format.  In this case, the simplest solution is for the new syslogd to forward
the message to its local reformatter and parse the reformatted message.

The fourth interface is between syslogd and a local flat file.  The simplest
solution is for syslogd to write the same textual representation that it uses
for reading from /dev/log and from network sockets.  This has the advantage
of making it very easy to later reread all the local flat files, pipe them
through syslogd again, and enter them all into a database.

The fifth interface is between syslogd and a database of some sort.  The
database probably imposes more restrictions on the format of the stored data
than syslogd does since we haven't settled on a standard log format.
Consequently the type of database syslogd should be able to log to imposes
at least one of:

1) syslogd can only accept data formatted in a manner that corresponds to
   a format that the database stores easily,
2) syslogd must heavily reformat data that does not correspond to a format
   that the database stores easily, or
3) the data stored by syslogd might not be easily searched and indexed by
   the database.

Since a database is designed to aid in searching and indexing, we never want
compromise (3).  That leaves (1) and (2), and (1) is clearly far faster and
has much less risk of accidental data corruption or loss.

Thus if we decide that we want to use a database like, say, Oracle, we've
limited our choices for the first interface to (4), (5), or a reformatted and
restricted (6).  (4) has the advantage of being so structured that syslogd's
work is easy, but it is still inextensible.  (5) introduces the possibility
that the programmer may specify a field which does not correspond to any
column in the database.  (6) has the same disadvantage as (5) and still
requires an XML library in every logging program.

If we decide, on the other hand, to depend on XML entirely and to store
everything as XML using XML-specific tools, we've limited our choices to a
reformatted (4), a reformatted (5), or (6).  In this case (6) is by far the
best option.

The sixth interface is between syslogd and a remote syslog host.  If syslogd
speaks the new format, and that format is unintelligible to the remote host,
our logs might get dropped on the floor.  Thus syslogd either requires one
of:

0) Incompatibility,
1) Careful configuration and the ability to speak a little bit of the old
   format,
2) A new format which is backwards compatible with the old, or
3) A new format which is always sent and received over the network in a
   format which is backwards compatible with the old, but which is different
   the rest of the time.

(2) feels like it might be needlessly limiting, and (3) is ugly.  I like (1).

The seventh and eight interfaces are from flat files and databases to a post
processor, respectively.  For a flat file, this is the format of the messages
on disk, and for a database, this is the structure of the database.  Making
these fast and easy to parse is not, I think, a problem for field="value" or
for XML.

Let me pop off the stack and return to what I first said.  So, where exactly
do you want to use XML, and where exactly do I want to use field="value"?  I
want to use interface (5) for syslog(3), field="value" for all the other
interfaces mentioned above, and Oracle or something like it as a database.
I have the impression that you want to use one of interfaces (2), (3), or (6)
for syslog(3), XML for all the other interfaces mentioned above, and something
XML-specific as a database.  Is this correct?

> Another use for nesting is basically a namespace issue. I'd prefer to 
> have, say, a standard IDS message and all of the vendor-specific stuff 
> in a <$vendor> subtag rather than a bunch of vendor-prefixed tag names.

So would I, but that's what standards like IDMEF are for.  It is impossible
to force every vendor to report what you want the way you want it; the most
you can do is ask, and if they don't agree, you'll have to reformat their log
messages.  And furthermore this is true of any formatting scheme, including
field="value".

This is an area where XML seems to have an advantage because of the DTD, but
I'm not so sure that it's as much of an advantage as it might look like,
because to allow truly flexible logging, you must let the vendor define his
own DTD.  This is necessary, in fact, if you're implementing a new protocol
or service of some sort.  And that lets the vendor get away with the same
vendor-prefixed tags that you'd prefer he not have.

Moreover, from a formal perspective, the right way to let vendors have their
own DTDs is to nest them within messages, for example, <!DOCTYPE LOGMSG> ...
<!DOCTYPE ESMTP> ...  </DOCTYPE> ... </DOCTYPE>.  But [XML] sections 2.1, 2.8
has:

[1]     document  ::=    prolog element Misc*
[22]    prolog    ::=    XMLDecl? Misc* (doctypedecl Misc*)?

which implies that you'd have to make the nested documents some sort of
literal data and start up new instances of your XML parser for each nesting.
The easiest way to do this seems to be to start a <![CDATA[ section each time.
If you have multiple nested CDATA sections, the terminating ]]> of each inner
CDATA section would have to be quoted, first as ]]&gt;, then as ]]&amp;gt;,
then as ]]&amp;amp;gt;, and so on, which is ugly.  I'm not an XML expert,
though, so hopefully there's a better way.

> The other area where nesting feels more natural is dumping more complex 
> data - things like RPC calls or SSL negotiation:

Yes, and that brings up a really good question: How many log messages are
complex enough to make XML useful, and how many are not?  Something like an
NTP time reset is too simple to need XML, but an SSL negotiation is complex.
Especially pertinent, I think, is the fact that the SSL negotiation may
involve an arbitrarily long chain of certificates, which XML would handle
easily, while field="value" would not.

I'd be willing to admit that field="value" is the wrong choice if there are
a lot of possibly useful log messages that are too complex for it to handle.
Unfortunately, I don't think that's likely, I like the idea of being able to
use standard databases, and I'm wary of the ability of XML to handle huge
amounts of data efficiently, especially for post-processing.

> > And furthermore, we'd prefer to avoid "Message delivered
> > successfully" because that's a freeform string, so ideally all the tags
> > would be empty.
> 
> Presumably we'd define a DTD which would make any tags which could be 
> empty on successful transactions optional.

Not optional, because then successful transactions would never be recorded,
which we might not want.  Better would be to make them empty, e.g., <TAG/>
([XML], 3.1).

-- 
Kyle R. Hofmann <krhat_private>
_______________________________________________
LogAnalysis mailing list
LogAnalysisat_private
http://lists.shmoo.com/mailman/listinfo/loganalysis