[logs] Re: An Algorithm for Traffic Baselines

From: Wajih-ur-Rehman (wrehmanat_private)
Date: Fri Sep 05 2003 - 01:30:12 PDT

  • Next message: Port 911: "[logs] Audit - Log Retention - How Long - Legal Requirements?"

    Hello List,
    
    I have updated this paper. After the updates, I had a discussion with
    Rainer. I have summarized the discussion below as well. I would really
    appreciate any input to this algorithm and/or contribute any new idea about
    this algo. The link to the paper is:
    
    http://www.monitorware.com/en/workinprogress/Baseline-Algorithm-For-Traffic.asp
    
    Summary Of Discussion with Rainer
    ***************************
    
    After the discussion we feel that there is no need of having two different
    modes in the Algorithm as explained in the paper right now. We feel that it
    could be accomplished with just the "Learning mode". The algo should
    calculate the new baseline daily usign the Learning mode and test to see if
    the current day is an outlier or not. But when someone changes something in
    the system, there could be some consecutive outliers because initially what
    was considered outlier is not an outlier anymore after the change. It has
    become a normal data now. The algoirhtm should adapt itself to this change
    and this is how it will do. If the algo sees 7 consecutive outliers on the
    same side (too high or too low), then it does not have any actual sample
    because all the old values are now non-representative.
    
    At this point Rainer said
    
    "....I suggest the following: once we see these 7 outliers, we compute 7
    deltas (between the previous baseline and the outlier, one for each day). We
    than create an average of the deltas. We then use this average delta in the
    same way as a user-provided delta. In essence, I would assign a higher
    weight to an actual sample than to a admin-delta-modified one than to an
    auto-delta-modified one. For simplicity, lets say the auto-delta go in once
    in the computation, the admin-deltas two times and the actual ones three
    times. Actually, I would introduce the need for samples and elaborate/think
    about how many we need. then, I would qualify samples by "actual data in
    unchanged environment" and "actual data, but in old (now non-representative)
    environment". If the user provides an expected delta (he often can), you
    could upgrade the later ones to "estimated data in unchanged environment".
    Based on the category of the sample you could assign a weight to them, so
    that they go into a "weighted average computation"....."
    
    Regards
    Wajih-ur-Rehman
    
    
    
    
    > -----Original Message-----
    > From: Sweth Chandramouli [mailto:loganalysisat_private]
    > Sent: Thursday, August 21, 2003 6:15 AM
    > To: LogAnalysisat_private
    > Subject: Re: [logs] An Algorithm for Traffic Baselines
    >
    >
    > On Wednesday, 20 August 2003 at 01:22:31 EDT,
    >    Wajih-ur-Rehman (Wajih-ur-Rehman <wrehmanat_private>) wrote:
    > > I have been developing an algorithm for Traffic Baselines. I have
    > > written a paper on it.
    >
    > Assuming I'm reading your doc right, your basic algorithm is
    > to look at
    > a set of data and continually discard any values that aren't
    > within 30%
    > of the mean, until the mean stops changing.  Is there any reason to
    > believe that the resulting value has ANY statistical significance as a
    > baseline, let alone more significance than the something like the mean
    > or median?
    >
    > Among other things, when you start discarding values (which is almost
    > never statistically justified unless you've got a meaningful model to
    > explain the outliers) using an absolute offset, then you open yourself
    > up to the possibility of having a null set as your result if
    > the data is
    > at all skewed.  Try your algorithm with the input set of, say,
    > [ 0 1 2 3 4 5 6 7 8 9 10 12 14 60001 60002 65534 108 109 110
    > 111 112 113 ]
    > ; the first iteration returns a "baseline" of infinity.
    >
    > The normal way to discard outliers is to use something like a t-test
    > (or Wilcoxon test if you can't assume normality of your distribution),
    > and the odds of discarding even a few values should be pretty minimal;
    > the entire point of creating a baseline is that you want to look at
    > what is a "normal" for the ENTIRE distribution--not just for some
    > arbitrary subset of the values.  If you absolutely have to
    > reinvent the
    > wheel and come up with your own algorithm, at least use a
    > criterion for
    > outliers that is based on ranked order rather than an absolute offset,
    > such as an interquartile distance test or something like that.
    >
    > In general, though, your basic summary statistics (mean, std
    > deviation)
    > are sufficient for determining baselines for this type of
    > data.  If you
    > are dealing with data that has a trending component, you can
    > use rolling
    > averages and exponential smoothing; if you have seasonal variablity to
    > deal with, then look into things like Holt-Winters Forecasting.
    >
    > (Jake Brutlag gave a really good presentation on using Holt-Winters to
    > detect outliers in traffic data at the LISA conference in NoLA a few
    > years ago that you can probably find on the web somewhere,
    > including some
    > code to implement it in RRDTool.  <plug>I'll also be leading
    > a half-day
    > tutorial on things like this at LISA in San Diego this
    > October.</plug>)
    >
    > -- Sweth, who really needs to finish up his slides for that tutorial
    > sometime soon.
    >
    > -- 
    > Sweth Chandramouli      Idiopathic Systems Consulting
    > svcat_private      http://www.idiopathic.net/
    > _______________________________________________
    > LogAnalysis mailing list
    > LogAnalysisat_private
    > http://lists.shmoo.com/mailman/listinfo/loganalysis
    >
    
    _______________________________________________
    LogAnalysis mailing list
    LogAnalysisat_private
    http://lists.shmoo.com/mailman/listinfo/loganalysis
    



    This archive was generated by hypermail 2b30 : Fri Sep 05 2003 - 09:35:07 PDT