Dear Sweth, Thanx a lot for your input. I will do some more research on t-test and Wilcoxon test to get the intial baseline as correct as possible. I will update the paper and modify the algorithm shortly and will upload the changes again. Once again, thanx for your input :) Best Regards Wajih-ur-Rehman > -----Original Message----- > From: Sweth Chandramouli [mailto:loganalysisat_private] > Sent: Thursday, August 21, 2003 6:15 AM > To: LogAnalysisat_private > Subject: Re: [logs] An Algorithm for Traffic Baselines > > > On Wednesday, 20 August 2003 at 01:22:31 EDT, > Wajih-ur-Rehman (Wajih-ur-Rehman <wrehmanat_private>) wrote: > > I have been developing an algorithm for Traffic Baselines. I have > > written a paper on it. > > Assuming I'm reading your doc right, your basic algorithm is > to look at > a set of data and continually discard any values that aren't > within 30% > of the mean, until the mean stops changing. Is there any reason to > believe that the resulting value has ANY statistical significance as a > baseline, let alone more significance than the something like the mean > or median? > > Among other things, when you start discarding values (which is almost > never statistically justified unless you've got a meaningful model to > explain the outliers) using an absolute offset, then you open yourself > up to the possibility of having a null set as your result if > the data is > at all skewed. Try your algorithm with the input set of, say, > [ 0 1 2 3 4 5 6 7 8 9 10 12 14 60001 60002 65534 108 109 110 > 111 112 113 ] > ; the first iteration returns a "baseline" of infinity. > > The normal way to discard outliers is to use something like a t-test > (or Wilcoxon test if you can't assume normality of your distribution), > and the odds of discarding even a few values should be pretty minimal; > the entire point of creating a baseline is that you want to look at > what is a "normal" for the ENTIRE distribution--not just for some > arbitrary subset of the values. If you absolutely have to > reinvent the > wheel and come up with your own algorithm, at least use a > criterion for > outliers that is based on ranked order rather than an absolute offset, > such as an interquartile distance test or something like that. > > In general, though, your basic summary statistics (mean, std > deviation) > are sufficient for determining baselines for this type of > data. If you > are dealing with data that has a trending component, you can > use rolling > averages and exponential smoothing; if you have seasonal variablity to > deal with, then look into things like Holt-Winters Forecasting. > > (Jake Brutlag gave a really good presentation on using Holt-Winters to > detect outliers in traffic data at the LISA conference in NoLA a few > years ago that you can probably find on the web somewhere, > including some > code to implement it in RRDTool. <plug>I'll also be leading > a half-day > tutorial on things like this at LISA in San Diego this > October.</plug>) > > -- Sweth, who really needs to finish up his slides for that tutorial > sometime soon. > > -- > Sweth Chandramouli Idiopathic Systems Consulting > svcat_private http://www.idiopathic.net/ > _______________________________________________ > LogAnalysis mailing list > LogAnalysisat_private > http://lists.shmoo.com/mailman/listinfo/loganalysis > _______________________________________________ LogAnalysis mailing list LogAnalysisat_private http://lists.shmoo.com/mailman/listinfo/loganalysis
This archive was generated by hypermail 2b30 : Wed Aug 27 2003 - 09:02:24 PDT