Re: [logs] OT: 'Automated Log Analysis'

From: Sweth Chandramouli (loganalysisat_private)
Date: Wed Jun 19 2002 - 00:49:01 PDT

Next message: Raistlin: "Re: [logs] Logs & the great unification theory"

Previous message: Sweth Chandramouli: "Re: [logs] OT: 'Automated Log Analysis'"
In reply to: Sweth Chandramouli: "Re: [logs] OT: 'Automated Log Analysis'"
Next in thread: Rajkumar S.: "[logs] How are you analysing logs now?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Jun 19, 2002 at 01:11:16AM -0400, Sweth Chandramouli wrote:
> On Tue, Jun 18, 2002 at 10:56:40AM -0700, Bill Rhodes wrote:
> > If you are truly after testable data merely for empirical purposes, you
> > might do well to write a small Perl script or some such which will excerpt
> > and then anonymize some statistically large yet still manageable chunk of
> > log data.
> 	Ask, and ye shall receive; I just sat down and hacked
> together a pretty nice (if I do say so myself) tool to do this, and
> have posted it as <http://www.enterpriselogging.net/tools/scrub_log>.  It
> doesn't handle the task of extracting a random sample, but I've got
> another script to do that that I'll be posting shortly in that same
> directory
	Grr... it appears that when I left Counterpane, the CD that
they sent me of my personal files that I had on machines there didn't
include all of the directories I had asked for, including the one that
had that script in it.  (Tina, I don't suppose you could ask around and
see if the "tools" subdir of my home dir on muon is still archived
somewhere, by any chance?)  I've just hacked together a new version, but
I make no assurances of it's algorithmical accuracy, since I had to do
it from memory.  It's available in the dir referenced in the URL above,
with the file name random_sample; I'd appreciate any feedback on it,
especially WRT whether it returns the right numbers of records using all
of the algorithms supported.
	Here's the pod from it:

    random_sample ( ( ( -m | -d ) <relative_sample_size> ) |
    <absolute_sample_size> ) input_file ...

    Sample Invocation:

    $ random_sample 1000 /var/log/syslog/daemonlog

    $ random_sample -m 10% /var/log/syslog/*

    Acts as a filter on input file(s) or standard input, returning a
    (pseudo-)random sample of the input. If an absolute sample size is
    specified as an integer, then that number of line-oriented records are
    returned, and an efficient algorithm is used that only requires a single
    read of the input stream, and whose maximum memory usage is only the
    amount needed to store the sample_size number of records. If the number
    of available records is smaller than the requested sample size, the
    program fails with an informative error.

    If the -m or -d flags are provided then the sample size is assumed to be
    an integer percentage (with or without percentage sign) between 0 and
    100 (inclusive), and that percentage of the line-oriented records given
    as input will be randomly selected and returned. The -d flag causes an
    algorithm to be used that optimizes for disk i/o, by reading the entire
    data stream into memory, determining the total number of records and the
    resulting desired absolute sample size (by rounding up from the
    specified percentage of the total number of records), and then selecting
    an appropriate subset of the input records. The -m flag causes an
    algorithm to be used that optimizes for memory usage, by processing but
    not storing the entire input stream to determine the total number of
    records, calculating the desired absolute sample size as discussed for -
    d, and then rereading the input stream and using the algorithm described
    above for an absolute requested sample size. If both -d and -m are
    specified, -m behaviour is assumed.

    NOTES

    Note that the -m algorithm requires two passes over the input stream,
    and thus requires that the input stream be provided via file interfaces
    rather than through standard input; behaviour if this requirement is
    violated is undefined.

    Note that none of the algorithms is guaranteed to produce a "stable-
    sorted" sample as compared to the original data set.

    Note that blank input lines are counted as blank records.

    Note that sampling is done in a pseudo-random fashion via the rand()
    functon, and is done with no replacement.

	-- Sweth.

-- 
Sweth Chandramouli      Idiopathic Systems Consulting
svcat_private      http://www.idiopathic.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: loganalysis-unsubscribeat_private
For additional commands, e-mail: loganalysis-helpat_private

Next message: Raistlin: "Re: [logs] Logs & the great unification theory"
Previous message: Sweth Chandramouli: "Re: [logs] OT: 'Automated Log Analysis'"
In reply to: Sweth Chandramouli: "Re: [logs] OT: 'Automated Log Analysis'"
Next in thread: Rajkumar S.: "[logs] How are you analysing logs now?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b30 : Wed Jun 19 2002 - 12:57:35 PDT