Re: Re[2]: [logs] Logging: World Domination

betat_private

2002-08-22-15:25:18 Chris Adams:
> >He put it compactly. A more elaborate statement might be "XML is a
> >very heavy-weight framework for constructing languages; while it may
> >be valuable in certain contexts involving highly automated,
> >distributed, and heterogenous maintenance of gigantic corpuses of
> 
> That sure sounds like logging to me.

If, however, you hadn't truncated my comment, it's possible it would
sound less like logging.

When I refer to "gigantic corpuses of structured text", I'm not
referring to log data; I'm referring to e.g. all the web pages on
the world-wide web; all the reference manuals in the Linux
Documentation Project; all the documentation that a company the size
of Sun cranks out; etc. The traditional homes of SGML. XML, which is
just SGML with some of the more obscure cruft trimmed off, is useful
in the same places that SGML is useful. It remains a lousy choice
for lightweight applications. Its flexibility is actually in
opposition to the goal of the project currently under discussion.

> We need automated creation, distribution and analysis to work
> across numerous different platforms with products from hundreds of
> different vendors.

We can express the semantics that we need with a record that's a
linear list of whitespace-separated tokens on a single text line,
with some fixed fields aways required, followed by heirarchically
assigned tokens, first one defining a category ("OS"; "Firewall";
"DB"; "Webserver"; ...); then a separate list of tokens for each of
those categories; the remainder of the record with format defined
appropriately for each of those; and so forth.

Or we could express the same thing with XML, to buy ourselves some
buzzword compliance at the expense of a preposterously more complex
(==inefficient, nonportable, bugridden, security-problem-inducing)
parser for a class of complex structured languages. If we want to
sabotage this project, XML would be a fine step.

Or, as the previous poster more reasonably put it, "XML sucks".

> That's the whole point to interchange languages like XML - you
> can translate it into anything you like when it gets to the final
> storage point but use a standard format to cross the vendor
> boundaries.

That's the whole point to a canonical reference format. XML is more
complex than we need for this application; the only things it
contributes --- increased flexibility in record format, increased
likelihood of security and performance and portability problems due
to the vastly more complex parser required --- are things that we do
not want.

> I get the impression that you haven't seriously looked at XML in several 
> years.

I keep an eye on it, but the only problems for which it's a
reasonable answer that I've seen are problems in the traditional
SGML spaces.

> There are now a number of high-performance validating XML parsers 
> available, both commercial and open source.

Find me one anywhere approaching the simplicity of
split-on-whitespace and I get interested. If it buys us nothing but
increased complexity, then it's a lose.

> We could provide the same thing in for any new format but that's a huge 
> amount of work we can get for free if we decide that while XML isn't 
> perfect it is good enough to do the job.

It's only a huge amount of work for people who have trouble
splitting text on whitespace. I hope we won't have them working on
these tools. I sure wouldn't want anybody who couldn't easily code
split on whitespace doing security-critical coding with an XML
toolkit. That sounds like a really really bad idea.

> >But even:
> >
> >	<event host="..." timestamp="1234567890">
> >
> >would seem to me to be less desireable than
> >
> >	1234567890 ...
> >
> >I sure know which I'd rather parse.
> 
> The second one is easier. Unfortunately, "1234567890 ..." contains no 
> information about what each of the fields actually means and we need to
> have a number of optional fields which are only applicable to certain 
> classes of message or are vendor specific.

If we want to be able to process the XML in a fashion more automated
than existing syslog noise, we have to force the XML to be
structured and conform to strict rules. We can enforce the same
constraints on a string of tokens. XML doesn't make this job easier.

> If we use XML, we don't have to change anything. If we're using
> white-space separated lists, we have to throw out everything
> and replace it with some sort of tagged format when we realize
> that we'd like to do more complex analysis and an smtp server
> has a fundamentally different set of things it can report than a
> firewall, database, web server or storage manager.

Nope, we just need to have a different family of tags for SMTP
server records from the family of tags for firewalls, which in turn
would be distinct from database, webserver, storage manager, etc. We
must enforce the same data structuring requirements either way; if
we _use_ the extra flexibility XML gives us, we lose the ability to
automate the processing. They're equivalent for this job; let's use
the simpler one.

> Of course, there's a different answer to "I sure know which I'd rather 
> parse": it'll take me less time to do "$events = XMLIn('syslog.xml')" 
> than it will to parse anything.

Now find me an XML parser that can be used that way, portably across
platforms, which performs reasonably well, and whose code isn't
enough more complex than a simple whitespace split to guarantee new
and special security problems, and you'll have succeeded in making a
reasonable argument that XML isn't much worse than a simple
whitespace list of tokens. I have yet to see any way in which it
would be better.

> There seemed to be a general consensus that we need a replacement for 
> syslog [...]

I missed that consensus. We need a more formally structured log
format, allowing more structured representation of heirarchical
classification, and perhaps some other things.

> [...] which is tag based to allow different bits of information to
> be recorded in a structured fashion - that's why we need more than
> the simpler format you proposed can deliver.

I've yet to see that. We need a prefix of fixed fields, followed by
heirarchically assigned tokens, possibly followed by something else.
But the only use for the flexibility of XML would be to help ensure
that the data could not be automatically processed any better than
the current syslog raw strings.

> The question is whether we should invent our own format or use a
> standard format like XML.

XML is a lovely standard, well-suited to a widlly different
application domain. For jobs like this, it sucks.

> I think that the overhead of using XML will not be significant
> compared to using any other tagged format and that there's a big
> advantage to picking a widely used, well supported standard.

Whew. Where's my XML parser that's not substantially heavier-weight
than perl's split// or C's strtok(3)? Find me one and I'll back off.
Paying orders of magnitude in performance and code complexity to buy
the flexibility to prevent our project from succeeding sounds like a
lose to me.

> In a perfect world we'd have time to develop the One True Log
> format and legions of programmers to spend months providing
> support for it everywhere. Since that's not the case, I inclined
> to say that anything which allows us to spend less time
> reinventing the wheel and more time on analysis is a good thing.

XML is more work for everybody to implement, and makes the job
harder, and less likely to succeed. XML sucks.

-Bennett