Re: [logs] perl question relating to log analysis

From: Andy_Bachat_private
Date: Thu Aug 29 2002 - 12:29:46 PDT

  • Next message: Anton Chuvakin: "RE: [logs] what to log/what to look for: stateful log analysis?"

    I asked a perl group expert ...
    
    "To understand recursion, we must first understand recursion."
    ----- Forwarded by Andy Bach/WIWB/07/USCOURTS on 08/29/02 02:28 PM -----
    
    
    Quoting Adam Rice wysiwygat_private
    
    I've done a lot of log-munging in Perl, and I must report that for any
    significant amount of logs, regexps just aren't fast enough. In some cases
    I've found a solution using index() and rindex() that was adequate. But 
    once
    you get to that level of optimisation, Perl becomes as ugly as C, and the 
    C
    solution is generally more flexible (because it doesn't have to be
    hand-optimised to death to achieve acceptable speed).
    
    If you have to use regexps, it's worth tinkering with them. Often with
    careful use of character classes, you can save Perl from having to do
    backtracking. Try to avoid anchoring from the end of the string... it 
    looks
    like it should be fast, but in my experience it isn't. Anchor to the start
    of the string where it makes sense, but not if it makes the regexp more
    complicated. Complex regular expressions are really slow, so try breaking
    them down into several smaller ones.
    
    On the other hand, for doing ad-hoc queries against server logs, Perl is
    usually the language of choice. Cute tip: since the grep variants are way
    faster than Perl, use them to narrow the field before Perl does the grunt
    work. Say you want a list of JPEG files larger than 200k, together with 
    how
    often they were served:
    
    zgrep -F ".jpg" logfile.gz | egrep ' [0-9][0-9][0-9][0-9][0-9][0-9] ' | 
    perl
    -ne 'print "$1\t$2\n" if / "GET (\/[^\s\"]+)[^"]*" \d+ (\d+) / && 
    $2>200*1024'
    | sort | uniq -c
    
    Always test on a subset of your logs first! Where I work, a command like
    this will take an hour on a full month's logs, and you'll be very annoyed 
    if
    you wait that long to discover you made a typo.
    
    Tip 3: "top" is good for getting an immediate idea of how efficient your
    command is. Ideally you want the "gzip" process using 80% or more of the
    CPU. If it's only pulling 20%, it'll take four times as long. With a
    multi-stage pipe like this, you can easily see which stage is the 
    bottleneck.
    
    Adam
    
    -- 
    Adam Rice -- wysiwygat_private -- Blackburn, Lancashire, 
    England
    
    
    _______________________________________________
    LogAnalysis mailing list
    LogAnalysisat_private
    http://lists.shmoo.com/mailman/listinfo/loganalysis
    



    This archive was generated by hypermail 2b30 : Thu Aug 29 2002 - 13:39:29 PDT