[logs] Re: An Algorithm for Traffic Baselines

From: Wajih-ur-Rehman (wrehmanat_private)
Date: Wed Aug 27 2003 - 02:45:25 PDT

  • Next message: Mervin Pearce: "[logs] Log Collator - Starting with syslog ;-)"

    Dear Sweth,
    
    Thanx a lot for your input. I will do some more research on t-test and
    Wilcoxon test to get the intial baseline as correct as possible. I will
    update the paper and modify the algorithm shortly and will upload the
    changes again. Once again, thanx for your input :)
    
    Best Regards
    Wajih-ur-Rehman
    
    
    > -----Original Message-----
    > From: Sweth Chandramouli [mailto:loganalysisat_private]
    > Sent: Thursday, August 21, 2003 6:15 AM
    > To: LogAnalysisat_private
    > Subject: Re: [logs] An Algorithm for Traffic Baselines
    >
    >
    > On Wednesday, 20 August 2003 at 01:22:31 EDT,
    >    Wajih-ur-Rehman (Wajih-ur-Rehman <wrehmanat_private>) wrote:
    > > I have been developing an algorithm for Traffic Baselines. I have
    > > written a paper on it.
    >
    > Assuming I'm reading your doc right, your basic algorithm is
    > to look at
    > a set of data and continually discard any values that aren't
    > within 30%
    > of the mean, until the mean stops changing.  Is there any reason to
    > believe that the resulting value has ANY statistical significance as a
    > baseline, let alone more significance than the something like the mean
    > or median?
    >
    > Among other things, when you start discarding values (which is almost
    > never statistically justified unless you've got a meaningful model to
    > explain the outliers) using an absolute offset, then you open yourself
    > up to the possibility of having a null set as your result if
    > the data is
    > at all skewed.  Try your algorithm with the input set of, say,
    > [ 0 1 2 3 4 5 6 7 8 9 10 12 14 60001 60002 65534 108 109 110
    > 111 112 113 ]
    > ; the first iteration returns a "baseline" of infinity.
    >
    > The normal way to discard outliers is to use something like a t-test
    > (or Wilcoxon test if you can't assume normality of your distribution),
    > and the odds of discarding even a few values should be pretty minimal;
    > the entire point of creating a baseline is that you want to look at
    > what is a "normal" for the ENTIRE distribution--not just for some
    > arbitrary subset of the values.  If you absolutely have to
    > reinvent the
    > wheel and come up with your own algorithm, at least use a
    > criterion for
    > outliers that is based on ranked order rather than an absolute offset,
    > such as an interquartile distance test or something like that.
    >
    > In general, though, your basic summary statistics (mean, std
    > deviation)
    > are sufficient for determining baselines for this type of
    > data.  If you
    > are dealing with data that has a trending component, you can
    > use rolling
    > averages and exponential smoothing; if you have seasonal variablity to
    > deal with, then look into things like Holt-Winters Forecasting.
    >
    > (Jake Brutlag gave a really good presentation on using Holt-Winters to
    > detect outliers in traffic data at the LISA conference in NoLA a few
    > years ago that you can probably find on the web somewhere,
    > including some
    > code to implement it in RRDTool.  <plug>I'll also be leading
    > a half-day
    > tutorial on things like this at LISA in San Diego this
    > October.</plug>)
    >
    > -- Sweth, who really needs to finish up his slides for that tutorial
    > sometime soon.
    >
    > -- 
    > Sweth Chandramouli      Idiopathic Systems Consulting
    > svcat_private      http://www.idiopathic.net/
    > _______________________________________________
    > LogAnalysis mailing list
    > LogAnalysisat_private
    > http://lists.shmoo.com/mailman/listinfo/loganalysis
    >
    
    _______________________________________________
    LogAnalysis mailing list
    LogAnalysisat_private
    http://lists.shmoo.com/mailman/listinfo/loganalysis
    



    This archive was generated by hypermail 2b30 : Wed Aug 27 2003 - 09:02:24 PDT