2002-08-22-15:25:18 Chris Adams: > >He put it compactly. A more elaborate statement might be "XML is a > >very heavy-weight framework for constructing languages; while it may > >be valuable in certain contexts involving highly automated, > >distributed, and heterogenous maintenance of gigantic corpuses of > > That sure sounds like logging to me. If, however, you hadn't truncated my comment, it's possible it would sound less like logging. When I refer to "gigantic corpuses of structured text", I'm not referring to log data; I'm referring to e.g. all the web pages on the world-wide web; all the reference manuals in the Linux Documentation Project; all the documentation that a company the size of Sun cranks out; etc. The traditional homes of SGML. XML, which is just SGML with some of the more obscure cruft trimmed off, is useful in the same places that SGML is useful. It remains a lousy choice for lightweight applications. Its flexibility is actually in opposition to the goal of the project currently under discussion. > We need automated creation, distribution and analysis to work > across numerous different platforms with products from hundreds of > different vendors. We can express the semantics that we need with a record that's a linear list of whitespace-separated tokens on a single text line, with some fixed fields aways required, followed by heirarchically assigned tokens, first one defining a category ("OS"; "Firewall"; "DB"; "Webserver"; ...); then a separate list of tokens for each of those categories; the remainder of the record with format defined appropriately for each of those; and so forth. Or we could express the same thing with XML, to buy ourselves some buzzword compliance at the expense of a preposterously more complex (==inefficient, nonportable, bugridden, security-problem-inducing) parser for a class of complex structured languages. If we want to sabotage this project, XML would be a fine step. Or, as the previous poster more reasonably put it, "XML sucks". > That's the whole point to interchange languages like XML - you > can translate it into anything you like when it gets to the final > storage point but use a standard format to cross the vendor > boundaries. That's the whole point to a canonical reference format. XML is more complex than we need for this application; the only things it contributes --- increased flexibility in record format, increased likelihood of security and performance and portability problems due to the vastly more complex parser required --- are things that we do not want. > I get the impression that you haven't seriously looked at XML in several > years. I keep an eye on it, but the only problems for which it's a reasonable answer that I've seen are problems in the traditional SGML spaces. > There are now a number of high-performance validating XML parsers > available, both commercial and open source. Find me one anywhere approaching the simplicity of split-on-whitespace and I get interested. If it buys us nothing but increased complexity, then it's a lose. > We could provide the same thing in for any new format but that's a huge > amount of work we can get for free if we decide that while XML isn't > perfect it is good enough to do the job. It's only a huge amount of work for people who have trouble splitting text on whitespace. I hope we won't have them working on these tools. I sure wouldn't want anybody who couldn't easily code split on whitespace doing security-critical coding with an XML toolkit. That sounds like a really really bad idea. > >But even: > > > > <event host="..." timestamp="1234567890"> > > > >would seem to me to be less desireable than > > > > 1234567890 ... > > > >I sure know which I'd rather parse. > > The second one is easier. Unfortunately, "1234567890 ..." contains no > information about what each of the fields actually means and we need to > have a number of optional fields which are only applicable to certain > classes of message or are vendor specific. If we want to be able to process the XML in a fashion more automated than existing syslog noise, we have to force the XML to be structured and conform to strict rules. We can enforce the same constraints on a string of tokens. XML doesn't make this job easier. > If we use XML, we don't have to change anything. If we're using > white-space separated lists, we have to throw out everything > and replace it with some sort of tagged format when we realize > that we'd like to do more complex analysis and an smtp server > has a fundamentally different set of things it can report than a > firewall, database, web server or storage manager. Nope, we just need to have a different family of tags for SMTP server records from the family of tags for firewalls, which in turn would be distinct from database, webserver, storage manager, etc. We must enforce the same data structuring requirements either way; if we _use_ the extra flexibility XML gives us, we lose the ability to automate the processing. They're equivalent for this job; let's use the simpler one. > Of course, there's a different answer to "I sure know which I'd rather > parse": it'll take me less time to do "$events = XMLIn('syslog.xml')" > than it will to parse anything. Now find me an XML parser that can be used that way, portably across platforms, which performs reasonably well, and whose code isn't enough more complex than a simple whitespace split to guarantee new and special security problems, and you'll have succeeded in making a reasonable argument that XML isn't much worse than a simple whitespace list of tokens. I have yet to see any way in which it would be better. > There seemed to be a general consensus that we need a replacement for > syslog [...] I missed that consensus. We need a more formally structured log format, allowing more structured representation of heirarchical classification, and perhaps some other things. > [...] which is tag based to allow different bits of information to > be recorded in a structured fashion - that's why we need more than > the simpler format you proposed can deliver. I've yet to see that. We need a prefix of fixed fields, followed by heirarchically assigned tokens, possibly followed by something else. But the only use for the flexibility of XML would be to help ensure that the data could not be automatically processed any better than the current syslog raw strings. > The question is whether we should invent our own format or use a > standard format like XML. XML is a lovely standard, well-suited to a widlly different application domain. For jobs like this, it sucks. > I think that the overhead of using XML will not be significant > compared to using any other tagged format and that there's a big > advantage to picking a widely used, well supported standard. Whew. Where's my XML parser that's not substantially heavier-weight than perl's split// or C's strtok(3)? Find me one and I'll back off. Paying orders of magnitude in performance and code complexity to buy the flexibility to prevent our project from succeeding sounds like a lose to me. > In a perfect world we'd have time to develop the One True Log > format and legions of programmers to spend months providing > support for it everywhere. Since that's not the case, I inclined > to say that anything which allows us to spend less time > reinventing the wheel and more time on analysis is a good thing. XML is more work for everybody to implement, and makes the job harder, and less likely to succeed. XML sucks. -Bennett
This archive was generated by hypermail 2b30 : Thu Aug 22 2002 - 14:42:05 PDT