On Wed, Jun 19, 2002 at 01:11:16AM -0400, Sweth Chandramouli wrote: > On Tue, Jun 18, 2002 at 10:56:40AM -0700, Bill Rhodes wrote: > > If you are truly after testable data merely for empirical purposes, you > > might do well to write a small Perl script or some such which will excerpt > > and then anonymize some statistically large yet still manageable chunk of > > log data. > Ask, and ye shall receive; I just sat down and hacked > together a pretty nice (if I do say so myself) tool to do this, and > have posted it as <http://www.enterpriselogging.net/tools/scrub_log>. It > doesn't handle the task of extracting a random sample, but I've got > another script to do that that I'll be posting shortly in that same > directory Grr... it appears that when I left Counterpane, the CD that they sent me of my personal files that I had on machines there didn't include all of the directories I had asked for, including the one that had that script in it. (Tina, I don't suppose you could ask around and see if the "tools" subdir of my home dir on muon is still archived somewhere, by any chance?) I've just hacked together a new version, but I make no assurances of it's algorithmical accuracy, since I had to do it from memory. It's available in the dir referenced in the URL above, with the file name random_sample; I'd appreciate any feedback on it, especially WRT whether it returns the right numbers of records using all of the algorithms supported. Here's the pod from it: random_sample ( ( ( -m | -d ) <relative_sample_size> ) | <absolute_sample_size> ) input_file ... Sample Invocation: $ random_sample 1000 /var/log/syslog/daemonlog $ random_sample -m 10% /var/log/syslog/* Acts as a filter on input file(s) or standard input, returning a (pseudo-)random sample of the input. If an absolute sample size is specified as an integer, then that number of line-oriented records are returned, and an efficient algorithm is used that only requires a single read of the input stream, and whose maximum memory usage is only the amount needed to store the sample_size number of records. If the number of available records is smaller than the requested sample size, the program fails with an informative error. If the -m or -d flags are provided then the sample size is assumed to be an integer percentage (with or without percentage sign) between 0 and 100 (inclusive), and that percentage of the line-oriented records given as input will be randomly selected and returned. The -d flag causes an algorithm to be used that optimizes for disk i/o, by reading the entire data stream into memory, determining the total number of records and the resulting desired absolute sample size (by rounding up from the specified percentage of the total number of records), and then selecting an appropriate subset of the input records. The -m flag causes an algorithm to be used that optimizes for memory usage, by processing but not storing the entire input stream to determine the total number of records, calculating the desired absolute sample size as discussed for - d, and then rereading the input stream and using the algorithm described above for an absolute requested sample size. If both -d and -m are specified, -m behaviour is assumed. NOTES Note that the -m algorithm requires two passes over the input stream, and thus requires that the input stream be provided via file interfaces rather than through standard input; behaviour if this requirement is violated is undefined. Note that none of the algorithms is guaranteed to produce a "stable- sorted" sample as compared to the original data set. Note that blank input lines are counted as blank records. Note that sampling is done in a pseudo-random fashion via the rand() functon, and is done with no replacement. -- Sweth. -- Sweth Chandramouli Idiopathic Systems Consulting svcat_private http://www.idiopathic.net/ --------------------------------------------------------------------- To unsubscribe, e-mail: loganalysis-unsubscribeat_private For additional commands, e-mail: loganalysis-helpat_private
This archive was generated by hypermail 2b30 : Wed Jun 19 2002 - 12:57:35 PDT