On Fri, 23 Aug 2002 00:29:22 -0700, Chris Adams wrote: > On Thursday, August 22, 2002, at 09:39 , Kyle R. Hofmann wrote: > >> - support for hierarchal data > > > > But the second one isn't. We need a distinction between data that are > > required for every message, like timestamps, and data that are specific > > to the type of message. The ability to nest is unimportant, and that is > > why XML is overkill. > > Nesting is definitely less of a priority but I think it would be useful > for having standard tags which could be used in different events to > refer to things like IP packets or login requests: [ ... ] > The idea is just that it would be nice for analysis packages to be able > to recognize "standard" chunks and use them for basic reporting even if > they don't have domain-specific knowledge about the event. A monitor > could tell which services a given user is trying to access even if it > doesn't know any of the service-specific details. I agree, that would be nice. But it occurred to me last night that we've all really been unclear on one thing, which is where we use our schemes. Let me go on a long digression while I explain what I think we're trying to get to. A log message is generated when a program reaches a statement like syslog(LOG_INFO, "Hello, world!"); The log message is sent to the log socket /dev/log. syslogd (or its replacement; I'll just say syslogd) reads a message from /dev/log or a network socket. It determines the facility and priority of the message and then forwards it to its destination, such as a remote syslog server, a file on the local host, or a database. This gives us eight interfaces: write(2) +--------+ +-------+ syslog(3) +--------+ read(2) .----> |/var/log| \ +----+ |Program| --------> |/dev/log| ----. / +--------+ `-->|Post| +-------+ +--------+ \> +-------+ / (varies) +--------+ .>+----+ |syslogd| ---------> |database| / +--------------+ /> +-------+ \ +--------+ |network socket| ---------' \ +--------------+ +--------------+ recvfrom(2) `-----> |network socket| sendto(2) +--------------+ (where Post is a post processor; I ran out of horizontal room) The first interface is between the program and /dev/log. This is a library call. Since we're proposing to change the standard for log messages to something, either this will have to be replaced with a call that provides more structure, or we will have to rely on programmers to always use the right structure all the time. As I see it, our options are: 0) Leave syslog(3) unchanged and give up. 1) Leave syslog(3) unchanged and add a reformatter between /dev/log and syslogd. 2) Leave syslog(3) unchanged and trust the programmer to use only "new-style" log messages. 3) Replace syslog(3), but do not significantly change the interface; trust the programmer to use only "new-style" log messages on the new library call. 4) Replace syslog(3) with int newsyslog(struct logmsg *logmsg, void *logdata); where struct logmsg has fields like facility, priority, and a type for logdata. logmsg->type would determine what sort of struct * to cast logdata to. This ensures that the data will contain exactly the structures that we want it to, but is inextensible: Adding a new format requires recompiling the syslog daemon and has portability problems. 5) Replace syslog(3) with int newsyslog(unsigned int logfieldcount, struct field *logfields, unsigned int varfieldcount, struct field *varfields); where logfields and varfields are arrays of structures and logfieldcount and varfield count determine the number of fields in logfields and varfields. struct field would be struct field { unsigned char *name; unsigned char *value; unsigned int name_length; unsigned int value_length; }; This does not ensure that we have all the data we need or want, but is very extensible. (Note that doing without struct field is a bad idea. We can't safely assume that there will never be any \0's in name and value.) 6) Replace syslog(3) with int newsyslog(xml_t xml_logmsg); where xml_logmsg is some representation of a log message as XML. This requires developing a set of XML library routines to manipulate log messages and using them in every program that wants to log. Of these, my favorite is (5), though I think the names logfieldcount and varfieldcount are too long. (1) is a good stopgap. The second interface is between /dev/log and syslogd. syslogd needs to read the message, parse the priority and facility, and forward it, so consequently syslogd needs to be able to read whatever format is used to write to /dev/log. Clearly parsing is easiest if every message is in a regular format, and in the long run we'd prefer to use that format directly instead of passing through an intermediate format. Thus it seems to me that syslogd should read exactly one format, a textual representation of (4), (5), or (6) above, and syslogd should depend upon a reformatter for messages not in that format. It is possible for syslogd to depend on reformatter at its output end, but it is possible that there will be no reformatter or that the protocol used for output makes use of a reformatter very hard (such as when writing to a database). The third interface is between a remote host sending a message and syslogd. This should obviously be TCP instead of the traditional UDP. To minimize the amount of code, this should use the same format as syslogd does for /dev/log, and syslogd should parse it in the same way. There is the possibility that the remote host is using an old syslog daemon which is unaware of the new format. In this case, the simplest solution is for the new syslogd to forward the message to its local reformatter and parse the reformatted message. The fourth interface is between syslogd and a local flat file. The simplest solution is for syslogd to write the same textual representation that it uses for reading from /dev/log and from network sockets. This has the advantage of making it very easy to later reread all the local flat files, pipe them through syslogd again, and enter them all into a database. The fifth interface is between syslogd and a database of some sort. The database probably imposes more restrictions on the format of the stored data than syslogd does since we haven't settled on a standard log format. Consequently the type of database syslogd should be able to log to imposes at least one of: 1) syslogd can only accept data formatted in a manner that corresponds to a format that the database stores easily, 2) syslogd must heavily reformat data that does not correspond to a format that the database stores easily, or 3) the data stored by syslogd might not be easily searched and indexed by the database. Since a database is designed to aid in searching and indexing, we never want compromise (3). That leaves (1) and (2), and (1) is clearly far faster and has much less risk of accidental data corruption or loss. Thus if we decide that we want to use a database like, say, Oracle, we've limited our choices for the first interface to (4), (5), or a reformatted and restricted (6). (4) has the advantage of being so structured that syslogd's work is easy, but it is still inextensible. (5) introduces the possibility that the programmer may specify a field which does not correspond to any column in the database. (6) has the same disadvantage as (5) and still requires an XML library in every logging program. If we decide, on the other hand, to depend on XML entirely and to store everything as XML using XML-specific tools, we've limited our choices to a reformatted (4), a reformatted (5), or (6). In this case (6) is by far the best option. The sixth interface is between syslogd and a remote syslog host. If syslogd speaks the new format, and that format is unintelligible to the remote host, our logs might get dropped on the floor. Thus syslogd either requires one of: 0) Incompatibility, 1) Careful configuration and the ability to speak a little bit of the old format, 2) A new format which is backwards compatible with the old, or 3) A new format which is always sent and received over the network in a format which is backwards compatible with the old, but which is different the rest of the time. (2) feels like it might be needlessly limiting, and (3) is ugly. I like (1). The seventh and eight interfaces are from flat files and databases to a post processor, respectively. For a flat file, this is the format of the messages on disk, and for a database, this is the structure of the database. Making these fast and easy to parse is not, I think, a problem for field="value" or for XML. Let me pop off the stack and return to what I first said. So, where exactly do you want to use XML, and where exactly do I want to use field="value"? I want to use interface (5) for syslog(3), field="value" for all the other interfaces mentioned above, and Oracle or something like it as a database. I have the impression that you want to use one of interfaces (2), (3), or (6) for syslog(3), XML for all the other interfaces mentioned above, and something XML-specific as a database. Is this correct? > Another use for nesting is basically a namespace issue. I'd prefer to > have, say, a standard IDS message and all of the vendor-specific stuff > in a <$vendor> subtag rather than a bunch of vendor-prefixed tag names. So would I, but that's what standards like IDMEF are for. It is impossible to force every vendor to report what you want the way you want it; the most you can do is ask, and if they don't agree, you'll have to reformat their log messages. And furthermore this is true of any formatting scheme, including field="value". This is an area where XML seems to have an advantage because of the DTD, but I'm not so sure that it's as much of an advantage as it might look like, because to allow truly flexible logging, you must let the vendor define his own DTD. This is necessary, in fact, if you're implementing a new protocol or service of some sort. And that lets the vendor get away with the same vendor-prefixed tags that you'd prefer he not have. Moreover, from a formal perspective, the right way to let vendors have their own DTDs is to nest them within messages, for example, <!DOCTYPE LOGMSG> ... <!DOCTYPE ESMTP> ... </DOCTYPE> ... </DOCTYPE>. But [XML] sections 2.1, 2.8 has: [1] document ::= prolog element Misc* [22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)? which implies that you'd have to make the nested documents some sort of literal data and start up new instances of your XML parser for each nesting. The easiest way to do this seems to be to start a <![CDATA[ section each time. If you have multiple nested CDATA sections, the terminating ]]> of each inner CDATA section would have to be quoted, first as ]]>, then as ]]&gt;, then as ]]&amp;gt;, and so on, which is ugly. I'm not an XML expert, though, so hopefully there's a better way. > The other area where nesting feels more natural is dumping more complex > data - things like RPC calls or SSL negotiation: Yes, and that brings up a really good question: How many log messages are complex enough to make XML useful, and how many are not? Something like an NTP time reset is too simple to need XML, but an SSL negotiation is complex. Especially pertinent, I think, is the fact that the SSL negotiation may involve an arbitrarily long chain of certificates, which XML would handle easily, while field="value" would not. I'd be willing to admit that field="value" is the wrong choice if there are a lot of possibly useful log messages that are too complex for it to handle. Unfortunately, I don't think that's likely, I like the idea of being able to use standard databases, and I'm wary of the ability of XML to handle huge amounts of data efficiently, especially for post-processing. > > And furthermore, we'd prefer to avoid "Message delivered > > successfully" because that's a freeform string, so ideally all the tags > > would be empty. > > Presumably we'd define a DTD which would make any tags which could be > empty on successful transactions optional. Not optional, because then successful transactions would never be recorded, which we might not want. Better would be to make them empty, e.g., <TAG/> ([XML], 3.1). -- Kyle R. Hofmann <krhat_private> _______________________________________________ LogAnalysis mailing list LogAnalysisat_private http://lists.shmoo.com/mailman/listinfo/loganalysis
This archive was generated by hypermail 2b30 : Fri Aug 23 2002 - 23:46:08 PDT