[logs] Re: An Algorithm for Traffic Baselines

wrehmanat_private

Dear Sweth,

Thanx a lot for your input. I will do some more research on t-test and
Wilcoxon test to get the intial baseline as correct as possible. I will
update the paper and modify the algorithm shortly and will upload the
changes again. Once again, thanx for your input :)

Best Regards
Wajih-ur-Rehman

> -----Original Message-----
> From: Sweth Chandramouli [mailto:loganalysisat_private]
> Sent: Thursday, August 21, 2003 6:15 AM
> To: LogAnalysisat_private
> Subject: Re: [logs] An Algorithm for Traffic Baselines
>
>
> On Wednesday, 20 August 2003 at 01:22:31 EDT,
>    Wajih-ur-Rehman (Wajih-ur-Rehman <wrehmanat_private>) wrote:
> > I have been developing an algorithm for Traffic Baselines. I have
> > written a paper on it.
>
> Assuming I'm reading your doc right, your basic algorithm is
> to look at
> a set of data and continually discard any values that aren't
> within 30%
> of the mean, until the mean stops changing.  Is there any reason to
> believe that the resulting value has ANY statistical significance as a
> baseline, let alone more significance than the something like the mean
> or median?
>
> Among other things, when you start discarding values (which is almost
> never statistically justified unless you've got a meaningful model to
> explain the outliers) using an absolute offset, then you open yourself
> up to the possibility of having a null set as your result if
> the data is
> at all skewed.  Try your algorithm with the input set of, say,
> [ 0 1 2 3 4 5 6 7 8 9 10 12 14 60001 60002 65534 108 109 110
> 111 112 113 ]
> ; the first iteration returns a "baseline" of infinity.
>
> The normal way to discard outliers is to use something like a t-test
> (or Wilcoxon test if you can't assume normality of your distribution),
> and the odds of discarding even a few values should be pretty minimal;
> the entire point of creating a baseline is that you want to look at
> what is a "normal" for the ENTIRE distribution--not just for some
> arbitrary subset of the values.  If you absolutely have to
> reinvent the
> wheel and come up with your own algorithm, at least use a
> criterion for
> outliers that is based on ranked order rather than an absolute offset,
> such as an interquartile distance test or something like that.
>
> In general, though, your basic summary statistics (mean, std
> deviation)
> are sufficient for determining baselines for this type of
> data.  If you
> are dealing with data that has a trending component, you can
> use rolling
> averages and exponential smoothing; if you have seasonal variablity to
> deal with, then look into things like Holt-Winters Forecasting.
>
> (Jake Brutlag gave a really good presentation on using Holt-Winters to
> detect outliers in traffic data at the LISA conference in NoLA a few
> years ago that you can probably find on the web somewhere,
> including some
> code to implement it in RRDTool.  <plug>I'll also be leading
> a half-day
> tutorial on things like this at LISA in San Diego this
> October.</plug>)
>
> -- Sweth, who really needs to finish up his slides for that tutorial
> sometime soon.
>
> -- 
> Sweth Chandramouli      Idiopathic Systems Consulting
> svcat_private      http://www.idiopathic.net/
> _______________________________________________
> LogAnalysis mailing list
> LogAnalysisat_private
> http://lists.shmoo.com/mailman/listinfo/loganalysis
>

_______________________________________________
LogAnalysis mailing list
LogAnalysisat_private
http://lists.shmoo.com/mailman/listinfo/loganalysis