Re: Re[2]: [logs] Logging: World Domination

cadamsat_private

On Thursday, August 22, 2002, at 02:02 , Bennett Todd wrote:
> When I refer to "gigantic corpuses of structured text", I'm not
> referring to log data; I'm referring to e.g. all the web pages on
> the world-wide web; all the reference manuals in the Linux
> Documentation Project; all the documentation that a company the size
> of Sun cranks out; etc. The traditional homes of SGML. XML, which is

While you may think of XML solely as the successor to SGML, it's used 
for considerably more than that. XML works great anywhere you need more 
flexibility than something like CSV gives you and want to produce 
something which is easily maintainable and easily handled by a variety 
of tools. A good example is parsing nmap's XML output with the older 
"grepable" format - there's a reason why it's now the preferred output 
mechanism and it's not buzzword-compliance.

>> We need automated creation, distribution and analysis to work
>> across numerous different platforms with products from hundreds of
>> different vendors.
>
> We can express the semantics that we need with a record that's a
> linear list of whitespace-separated tokens on a single text line,
> with some fixed fields aways required, followed by heirarchically
> assigned tokens, first one defining a category ("OS"; "Firewall";
> "DB"; "Webserver"; ...); then a separate list of tokens for each of
> those categories; the remainder of the record with format defined
> appropriately for each of those; and so forth.
...
> It's only a huge amount of work for people who have trouble
> splitting text on whitespace. I hope we won't have them working on

A simple whitespace delimited format is something like tab-delimited 
text - raw fields without metadata. What you're describing now is 
different:

- per-record metadata
- support for hierarchal data
- support for adding additional user/vendor-specific fields

That last one is important - if I use XML, I can add an attribute or tag 
anywhere I'd like without causing anything to break - code which isn't 
looking for it will simply ignore it. This is much cleaner than having 
to tack everything in a large extension jumble at the end where there's 
no connection between standard field a and custom field n.

I think you are consistently underestimating the amount of work 
producing a fast, bug-free parser which will handle a delimited format 
which is continually being redefined, while being careful to handle fun 
things like missing/overly long fields, embedded whitespace, Unicode 
text (including Unicode whitespace / line breaks, etc.) and assorted 
special characters (e.g. if someone does something cute like including a 
null - hope you used strtok() securely).

The hierarchal bit really means that this cannot be described as simple. 
Nesting is a really good idea but it increases the complexity of your 
parser and opens up plenty of opportunities for bugs.

Still, that isn't a huge amount of work to do - once. Now we have a 
non-standard format supported in one or two environments. Nobody else 
knows anything about it, absolutely nothing supports it and people who 
aren't using the same language or environment we prefer get to do it all 
over again. That's where the huge amount of work comes from - to make it 
as easy to use or widespread as XML, you need to provide support in 
numerous different languages on numerous different platforms and 
environments, taking care to maintain ease of use and provide 
documentation for anyone who's going to have to touch this new format.

Or you could just use XML and this has already been done. Plus more 
programmers have worked on optimization, bug-fixing and it's heavily 
tested, too.

> Or we could express the same thing with XML, to buy ourselves some
> buzzword compliance at the expense of a preposterously more complex
> (==inefficient, nonportable, bugridden, security-problem-inducing)

You take this as an article of faith but have yet to support it in any 
way. XML isn't just documents - people are using it to interface between 
different systems, replace RPC and otherwise provide a common format 
where they need one. Given that XML is in widespread use for high-volume 
tasks you'd think one of the programmers using it might have noticed 
such critical flaws.

> if we _use_ the extra flexibility XML gives us, we lose the ability to
> automate the processing.

I assume you had some specific point in mind here but I can't think of a 
case in which this statement isn't wrong. Could you share your rationale?

Remember that XML allows you to specify requirements (that's what a 
validating parser checks) but it's always extensible. There's no reason 
why an XML tag cannot contain addition elements beyond the ones it is 
required to have and any code which uses it will simply ignore the 
additional fields unless it is specifically written to look for them.

>> Of course, there's a different answer to "I sure know which I'd rather
>> parse": it'll take me less time to do "$events = XMLIn('syslog.xml')"
>> than it will to parse anything.
>
> Now find me an XML parser that can be used that way, portably across
> platforms, which performs reasonably well, and

That's Perl's XML::Simple, which is very widely available. There are 
similar tools like DOMXML which is also widely available. If you're into 
Microsoft stuff, there are quite a few options there.

>> There seemed to be a general consensus that we need a replacement for
>> syslog [...]
>
> I missed that consensus. We need a more formally structured log
> format, allowing more structured representation of heirarchical
> classification, and perhaps some other things.

In other words, what I said: we need something structured which is 
easily extensible. Any tagged format meets that requirement but a 
*simple* whitespace-delimited format does not.

> XML is more work for everybody to implement, and makes the job
> harder, and less likely to succeed. XML sucks.

The religious tone here is really making me lose interest in this thread.

Chris

_______________________________________________
LogAnalysis mailing list
LogAnalysisat_private
https://lists.shmoo.com/mailman/listinfo/loganalysis