Re: [logs] OT: 'Automated Log Analysis'

From: Sweth Chandramouli (loganalysisat_private)
Date: Wed Jun 19 2002 - 00:49:01 PDT

  • Next message: Raistlin: "Re: [logs] Logs & the great unification theory"

    On Wed, Jun 19, 2002 at 01:11:16AM -0400, Sweth Chandramouli wrote:
    > On Tue, Jun 18, 2002 at 10:56:40AM -0700, Bill Rhodes wrote:
    > > If you are truly after testable data merely for empirical purposes, you
    > > might do well to write a small Perl script or some such which will excerpt
    > > and then anonymize some statistically large yet still manageable chunk of
    > > log data.
    > 	Ask, and ye shall receive; I just sat down and hacked
    > together a pretty nice (if I do say so myself) tool to do this, and
    > have posted it as <http://www.enterpriselogging.net/tools/scrub_log>.  It
    > doesn't handle the task of extracting a random sample, but I've got
    > another script to do that that I'll be posting shortly in that same
    > directory
    	Grr... it appears that when I left Counterpane, the CD that
    they sent me of my personal files that I had on machines there didn't
    include all of the directories I had asked for, including the one that
    had that script in it.  (Tina, I don't suppose you could ask around and
    see if the "tools" subdir of my home dir on muon is still archived
    somewhere, by any chance?)  I've just hacked together a new version, but
    I make no assurances of it's algorithmical accuracy, since I had to do
    it from memory.  It's available in the dir referenced in the URL above,
    with the file name random_sample; I'd appreciate any feedback on it,
    especially WRT whether it returns the right numbers of records using all
    of the algorithms supported.
    	Here's the pod from it:
    
        random_sample ( ( ( -m | -d ) <relative_sample_size> ) |
        <absolute_sample_size> ) input_file ...
    
        Sample Invocation:
    
        $ random_sample 1000 /var/log/syslog/daemonlog
    
        $ random_sample -m 10% /var/log/syslog/*
    
        Acts as a filter on input file(s) or standard input, returning a
        (pseudo-)random sample of the input. If an absolute sample size is
        specified as an integer, then that number of line-oriented records are
        returned, and an efficient algorithm is used that only requires a single
        read of the input stream, and whose maximum memory usage is only the
        amount needed to store the sample_size number of records. If the number
        of available records is smaller than the requested sample size, the
        program fails with an informative error.
    
        If the -m or -d flags are provided then the sample size is assumed to be
        an integer percentage (with or without percentage sign) between 0 and
        100 (inclusive), and that percentage of the line-oriented records given
        as input will be randomly selected and returned. The -d flag causes an
        algorithm to be used that optimizes for disk i/o, by reading the entire
        data stream into memory, determining the total number of records and the
        resulting desired absolute sample size (by rounding up from the
        specified percentage of the total number of records), and then selecting
        an appropriate subset of the input records. The -m flag causes an
        algorithm to be used that optimizes for memory usage, by processing but
        not storing the entire input stream to determine the total number of
        records, calculating the desired absolute sample size as discussed for -
        d, and then rereading the input stream and using the algorithm described
        above for an absolute requested sample size. If both -d and -m are
        specified, -m behaviour is assumed.
    
        NOTES
    
        Note that the -m algorithm requires two passes over the input stream,
        and thus requires that the input stream be provided via file interfaces
        rather than through standard input; behaviour if this requirement is
        violated is undefined.
    
        Note that none of the algorithms is guaranteed to produce a "stable-
        sorted" sample as compared to the original data set.
    
        Note that blank input lines are counted as blank records.
    
        Note that sampling is done in a pseudo-random fashion via the rand()
        functon, and is done with no replacement.
    
    	-- Sweth.
    
    -- 
    Sweth Chandramouli      Idiopathic Systems Consulting
    svcat_private      http://www.idiopathic.net/
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: loganalysis-unsubscribeat_private
    For additional commands, e-mail: loganalysis-helpat_private
    



    This archive was generated by hypermail 2b30 : Wed Jun 19 2002 - 12:57:35 PDT