On Thursday, August 22, 2002, at 02:02 , Bennett Todd wrote: > When I refer to "gigantic corpuses of structured text", I'm not > referring to log data; I'm referring to e.g. all the web pages on > the world-wide web; all the reference manuals in the Linux > Documentation Project; all the documentation that a company the size > of Sun cranks out; etc. The traditional homes of SGML. XML, which is While you may think of XML solely as the successor to SGML, it's used for considerably more than that. XML works great anywhere you need more flexibility than something like CSV gives you and want to produce something which is easily maintainable and easily handled by a variety of tools. A good example is parsing nmap's XML output with the older "grepable" format - there's a reason why it's now the preferred output mechanism and it's not buzzword-compliance. >> We need automated creation, distribution and analysis to work >> across numerous different platforms with products from hundreds of >> different vendors. > > We can express the semantics that we need with a record that's a > linear list of whitespace-separated tokens on a single text line, > with some fixed fields aways required, followed by heirarchically > assigned tokens, first one defining a category ("OS"; "Firewall"; > "DB"; "Webserver"; ...); then a separate list of tokens for each of > those categories; the remainder of the record with format defined > appropriately for each of those; and so forth. ... > It's only a huge amount of work for people who have trouble > splitting text on whitespace. I hope we won't have them working on A simple whitespace delimited format is something like tab-delimited text - raw fields without metadata. What you're describing now is different: - per-record metadata - support for hierarchal data - support for adding additional user/vendor-specific fields That last one is important - if I use XML, I can add an attribute or tag anywhere I'd like without causing anything to break - code which isn't looking for it will simply ignore it. This is much cleaner than having to tack everything in a large extension jumble at the end where there's no connection between standard field a and custom field n. I think you are consistently underestimating the amount of work producing a fast, bug-free parser which will handle a delimited format which is continually being redefined, while being careful to handle fun things like missing/overly long fields, embedded whitespace, Unicode text (including Unicode whitespace / line breaks, etc.) and assorted special characters (e.g. if someone does something cute like including a null - hope you used strtok() securely). The hierarchal bit really means that this cannot be described as simple. Nesting is a really good idea but it increases the complexity of your parser and opens up plenty of opportunities for bugs. Still, that isn't a huge amount of work to do - once. Now we have a non-standard format supported in one or two environments. Nobody else knows anything about it, absolutely nothing supports it and people who aren't using the same language or environment we prefer get to do it all over again. That's where the huge amount of work comes from - to make it as easy to use or widespread as XML, you need to provide support in numerous different languages on numerous different platforms and environments, taking care to maintain ease of use and provide documentation for anyone who's going to have to touch this new format. Or you could just use XML and this has already been done. Plus more programmers have worked on optimization, bug-fixing and it's heavily tested, too. > Or we could express the same thing with XML, to buy ourselves some > buzzword compliance at the expense of a preposterously more complex > (==inefficient, nonportable, bugridden, security-problem-inducing) You take this as an article of faith but have yet to support it in any way. XML isn't just documents - people are using it to interface between different systems, replace RPC and otherwise provide a common format where they need one. Given that XML is in widespread use for high-volume tasks you'd think one of the programmers using it might have noticed such critical flaws. > if we _use_ the extra flexibility XML gives us, we lose the ability to > automate the processing. I assume you had some specific point in mind here but I can't think of a case in which this statement isn't wrong. Could you share your rationale? Remember that XML allows you to specify requirements (that's what a validating parser checks) but it's always extensible. There's no reason why an XML tag cannot contain addition elements beyond the ones it is required to have and any code which uses it will simply ignore the additional fields unless it is specifically written to look for them. >> Of course, there's a different answer to "I sure know which I'd rather >> parse": it'll take me less time to do "$events = XMLIn('syslog.xml')" >> than it will to parse anything. > > Now find me an XML parser that can be used that way, portably across > platforms, which performs reasonably well, and That's Perl's XML::Simple, which is very widely available. There are similar tools like DOMXML which is also widely available. If you're into Microsoft stuff, there are quite a few options there. >> There seemed to be a general consensus that we need a replacement for >> syslog [...] > > I missed that consensus. We need a more formally structured log > format, allowing more structured representation of heirarchical > classification, and perhaps some other things. In other words, what I said: we need something structured which is easily extensible. Any tagged format meets that requirement but a *simple* whitespace-delimited format does not. > XML is more work for everybody to implement, and makes the job > harder, and less likely to succeed. XML sucks. The religious tone here is really making me lose interest in this thread. Chris _______________________________________________ LogAnalysis mailing list LogAnalysisat_private https://lists.shmoo.com/mailman/listinfo/loganalysis
This archive was generated by hypermail 2b30 : Thu Aug 22 2002 - 18:23:14 PDT