Re: [logs] An Algorithm for Traffic Baselines

From: Sweth Chandramouli (loganalysisat_private)
Date: Wed Aug 20 2003 - 21:14:55 PDT

  • Next message: Wajih-ur-Rehman: "[logs] Re: An Algorithm for Traffic Baselines"

    On Wednesday, 20 August 2003 at 01:22:31 EDT,
       Wajih-ur-Rehman (Wajih-ur-Rehman <wrehmanat_private>) wrote:
    > I have been developing an algorithm for Traffic Baselines. I have
    > written a paper on it.
    
    Assuming I'm reading your doc right, your basic algorithm is to look at
    a set of data and continually discard any values that aren't within 30%
    of the mean, until the mean stops changing.  Is there any reason to
    believe that the resulting value has ANY statistical significance as a
    baseline, let alone more significance than the something like the mean
    or median?
    
    Among other things, when you start discarding values (which is almost
    never statistically justified unless you've got a meaningful model to
    explain the outliers) using an absolute offset, then you open yourself
    up to the possibility of having a null set as your result if the data is
    at all skewed.  Try your algorithm with the input set of, say,
    [ 0 1 2 3 4 5 6 7 8 9 10 12 14 60001 60002 65534 108 109 110 111 112 113 ]
    ; the first iteration returns a "baseline" of infinity.
    
    The normal way to discard outliers is to use something like a t-test
    (or Wilcoxon test if you can't assume normality of your distribution),
    and the odds of discarding even a few values should be pretty minimal;
    the entire point of creating a baseline is that you want to look at
    what is a "normal" for the ENTIRE distribution--not just for some
    arbitrary subset of the values.  If you absolutely have to reinvent the
    wheel and come up with your own algorithm, at least use a criterion for
    outliers that is based on ranked order rather than an absolute offset,
    such as an interquartile distance test or something like that.
    
    In general, though, your basic summary statistics (mean, std deviation)
    are sufficient for determining baselines for this type of data.  If you
    are dealing with data that has a trending component, you can use rolling
    averages and exponential smoothing; if you have seasonal variablity to
    deal with, then look into things like Holt-Winters Forecasting.
    
    (Jake Brutlag gave a really good presentation on using Holt-Winters to
    detect outliers in traffic data at the LISA conference in NoLA a few
    years ago that you can probably find on the web somewhere, including some
    code to implement it in RRDTool.  <plug>I'll also be leading a half-day
    tutorial on things like this at LISA in San Diego this October.</plug>)
    
    -- Sweth, who really needs to finish up his slides for that tutorial
    sometime soon.
    
    -- 
    Sweth Chandramouli      Idiopathic Systems Consulting
    svcat_private      http://www.idiopathic.net/
    _______________________________________________
    LogAnalysis mailing list
    LogAnalysisat_private
    http://lists.shmoo.com/mailman/listinfo/loganalysis
    



    This archive was generated by hypermail 2b30 : Fri Aug 22 2003 - 15:21:01 PDT