Re: [logs] perl question relating to log analysis (fwd)

From: Glenn Forbes Fleming Larratt (glrattat_private)
Date: Wed Aug 28 2002 - 10:38:15 PDT

  • Next message: Ron Ogle: "Re: [logs] tokens and layouts..."

    More Perl - hit DELETE now if uninterested.
    
    Followup question: what efficiency considerations would apply if one
    were to  replace:
    
    return 2 if /msg 2/
    return 1 if /msg 1/
    return 3 if /msg 3/
    
    with:
    
    %returnvalue =
    (
      "msg 1" => 1,
      "msg 2" => 2,
      "msg 3" => 3,
    );
    # Assumption: all incidences of "msg 1", "msg 2", "msg 3" can be
    # isolated using a parenthesized regexp
    ###
    # various stuff, then...
    if(/$subsidiary_regexp_before($regexp_to_glean_mgs)$subsidiary_regexp_after/)
    { return $returnvalue{$1}; }
    
    ? Remember my CompSci studies, the former is O(n) in the average case and
    the worst case, whereas the latter is O(1) across the board; however, I
    don't know enough Perl internals to know whether the interpreter can
    optimize the former better than the latter.
    
    	-g
    
    
    				Glenn Forbes Fleming Larratt
    				Rice University Network Management
    				glrattat_private
    
    ---------- Forwarded message ----------
    Date: Tue, 27 Aug 2002 16:56:50 +0000 (UTC)
    From: Jeff Schaller <schallerat_private>
    To: Russell Fulton <r.fultonat_private>
    Cc: "loganalysisat_private" <loganalysisat_private>
    Subject: Re: [logs] perl question relating to log analysis
    
    On 27 Aug 2002, Russell Fulton wrote:
    
    > Those who are not interested in perl please hit DELETE now.
    
    Yet More Perl Ahead
    
    
    > > Try analysing your data and putting your most common cases first, so
    > > they will match sooner and return before the rest are executed.
    >
    > Given that the optimizer is working over multiple statements or
    > expressions I don't think the order is actually material.
    
    I think it would. Imagine you have 3 types of log entries.
    Message 1 occurs 10% of the time
    Message 2 occurs 80% of the time
    Message 3 occurs 10% of the time
    
    and that you order your function as follows:
    
    return 1 if /msg 1/
    return 3 if /msg 3/
    return 2 if /msg 2/
    
    then perl has to execute two extraneous (theoretically) pattern
    matches 80% of the time. I think the upshot is to order the tests
    in a best-guess order of frequency:
    
    return 2 if /msg 2/
    return 1 if /msg 1/
    return 3 if /msg 3/
    
    This all assumes that you have a good idea of what your data looks
    like frequency-wise /before you look at it/. I could see this
    community getting that done by collating a bunch of sanitized
    logs, coming up with tight REs to match various messages, and then
    grinding out the various statistics.
    
    I would also recommend playing with another "speed variable" --
    ordering your regular expressions according to length. RE's with
    more static text will be faster to match (or mismatch) than those
    with variability (. [a-z] alternation, etc).
    Eg.
    
    if (/seven/)
    
    can fail more quickly against "eight" than can:
    
    if (/^....$/)
    
    as it can fail on the initial "s" vs "e" as opposed to the
    character count difference at the end.
    
    -jeff
    -- 
    "Space is big.  You just won't believe how vastly, hugely,
     mind-bogglingly big it is.  I mean, you may think it's a
     long way down the road to the drug store, but that's just
     peanuts to space." -- The Hitchhiker's Guide to the Galaxy
    
    _______________________________________________
    LogAnalysis mailing list
    LogAnalysisat_private
    http://lists.shmoo.com/mailman/listinfo/loganalysis
    
    _______________________________________________
    LogAnalysis mailing list
    LogAnalysisat_private
    http://lists.shmoo.com/mailman/listinfo/loganalysis
    



    This archive was generated by hypermail 2b30 : Wed Aug 28 2002 - 11:12:57 PDT