Re: Re[2]: [logs] Logging: World Domination

From: Chris Adams (cadamsat_private)
Date: Thu Aug 22 2002 - 16:02:18 PDT

  • Next message: Chris Adams: "Re: Re[2]: [logs] Logging: World Domination"

    On Thursday, August 22, 2002, at 02:02 , Bennett Todd wrote:
    > When I refer to "gigantic corpuses of structured text", I'm not
    > referring to log data; I'm referring to e.g. all the web pages on
    > the world-wide web; all the reference manuals in the Linux
    > Documentation Project; all the documentation that a company the size
    > of Sun cranks out; etc. The traditional homes of SGML. XML, which is
    
    While you may think of XML solely as the successor to SGML, it's used 
    for considerably more than that. XML works great anywhere you need more 
    flexibility than something like CSV gives you and want to produce 
    something which is easily maintainable and easily handled by a variety 
    of tools. A good example is parsing nmap's XML output with the older 
    "grepable" format - there's a reason why it's now the preferred output 
    mechanism and it's not buzzword-compliance.
    
    >> We need automated creation, distribution and analysis to work
    >> across numerous different platforms with products from hundreds of
    >> different vendors.
    >
    > We can express the semantics that we need with a record that's a
    > linear list of whitespace-separated tokens on a single text line,
    > with some fixed fields aways required, followed by heirarchically
    > assigned tokens, first one defining a category ("OS"; "Firewall";
    > "DB"; "Webserver"; ...); then a separate list of tokens for each of
    > those categories; the remainder of the record with format defined
    > appropriately for each of those; and so forth.
    ...
    > It's only a huge amount of work for people who have trouble
    > splitting text on whitespace. I hope we won't have them working on
    
    A simple whitespace delimited format is something like tab-delimited 
    text - raw fields without metadata. What you're describing now is 
    different:
    
    - per-record metadata
    - support for hierarchal data
    - support for adding additional user/vendor-specific fields
    
    That last one is important - if I use XML, I can add an attribute or tag 
    anywhere I'd like without causing anything to break - code which isn't 
    looking for it will simply ignore it. This is much cleaner than having 
    to tack everything in a large extension jumble at the end where there's 
    no connection between standard field a and custom field n.
    
    I think you are consistently underestimating the amount of work 
    producing a fast, bug-free parser which will handle a delimited format 
    which is continually being redefined, while being careful to handle fun 
    things like missing/overly long fields, embedded whitespace, Unicode 
    text (including Unicode whitespace / line breaks, etc.) and assorted 
    special characters (e.g. if someone does something cute like including a 
    null - hope you used strtok() securely).
    
    The hierarchal bit really means that this cannot be described as simple. 
    Nesting is a really good idea but it increases the complexity of your 
    parser and opens up plenty of opportunities for bugs.
    
    Still, that isn't a huge amount of work to do - once. Now we have a 
    non-standard format supported in one or two environments. Nobody else 
    knows anything about it, absolutely nothing supports it and people who 
    aren't using the same language or environment we prefer get to do it all 
    over again. That's where the huge amount of work comes from - to make it 
    as easy to use or widespread as XML, you need to provide support in 
    numerous different languages on numerous different platforms and 
    environments, taking care to maintain ease of use and provide 
    documentation for anyone who's going to have to touch this new format.
    
    Or you could just use XML and this has already been done. Plus more 
    programmers have worked on optimization, bug-fixing and it's heavily 
    tested, too.
    
    > Or we could express the same thing with XML, to buy ourselves some
    > buzzword compliance at the expense of a preposterously more complex
    > (==inefficient, nonportable, bugridden, security-problem-inducing)
    
    You take this as an article of faith but have yet to support it in any 
    way. XML isn't just documents - people are using it to interface between 
    different systems, replace RPC and otherwise provide a common format 
    where they need one. Given that XML is in widespread use for high-volume 
    tasks you'd think one of the programmers using it might have noticed 
    such critical flaws.
    
    > if we _use_ the extra flexibility XML gives us, we lose the ability to
    > automate the processing.
    
    I assume you had some specific point in mind here but I can't think of a 
    case in which this statement isn't wrong. Could you share your rationale?
    
    Remember that XML allows you to specify requirements (that's what a 
    validating parser checks) but it's always extensible. There's no reason 
    why an XML tag cannot contain addition elements beyond the ones it is 
    required to have and any code which uses it will simply ignore the 
    additional fields unless it is specifically written to look for them.
    
    >> Of course, there's a different answer to "I sure know which I'd rather
    >> parse": it'll take me less time to do "$events = XMLIn('syslog.xml')"
    >> than it will to parse anything.
    >
    > Now find me an XML parser that can be used that way, portably across
    > platforms, which performs reasonably well, and
    
    That's Perl's XML::Simple, which is very widely available. There are 
    similar tools like DOMXML which is also widely available. If you're into 
    Microsoft stuff, there are quite a few options there.
    
    >> There seemed to be a general consensus that we need a replacement for
    >> syslog [...]
    >
    > I missed that consensus. We need a more formally structured log
    > format, allowing more structured representation of heirarchical
    > classification, and perhaps some other things.
    
    In other words, what I said: we need something structured which is 
    easily extensible. Any tagged format meets that requirement but a 
    *simple* whitespace-delimited format does not.
    
    > XML is more work for everybody to implement, and makes the job
    > harder, and less likely to succeed. XML sucks.
    
    The religious tone here is really making me lose interest in this thread.
    
    Chris
    
    _______________________________________________
    LogAnalysis mailing list
    LogAnalysisat_private
    https://lists.shmoo.com/mailman/listinfo/loganalysis
    



    This archive was generated by hypermail 2b30 : Thu Aug 22 2002 - 18:23:14 PDT