-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >Much of the discussion to date has been about analysis (visual or otherwise) >of the 'raw'data i.e. without adding anything to data. There has been some >talk of statistical processing but there is scope for further processing for >example 'clustering' of events to aid in anomaly detection or modeling of >data to determine which attributes in the data are most significant in >determining another attribute. The two comments: -The drawback to clustering is that it is computationally intensive. In my (perhaps biased) experience[0] the only clustering algorithms that consistently yield interesting results with intrusion/anomaly data are density-based clustering algorithms, most of which are of O(n^2) complexity in a vanilla formulation or O(n*log(n)) with some optimisations (i.e., indexing or presorting of the data, which may or may not be feasible depending on the problem space[1]). -Factor analysis doesn't appear to work. Well, except for a few very trivial cases (i.e., noisy, sequential portscans and suchlike). If there's some nontrivial class of events which you've found amenible to factor analysis, I'd be delighted to hear about it (honestly). Both of these (and particularly the former) should be read as coming from someone who does log analysis primarily for purposes of intrusion detection. I'd be entirely willing to believe there are other subsets of `pure' log analysis for which my comments do not hold true. It is also possible that these techniques have more general applicability with smaller datasets. But when you start looking at tens of millions of datapoints, clustering becomes useful only in the most academic sense[2]. What I'm actually surprised nobody else has mentioned in this context is using visualisation of lexical analysis of log data. I haven't actually done much with this so far, but what I've been fiddling with lately has been computing lexical distances between log entries of known values (known good or known bad) and using clustering techniques on the result. Everything I've done along these lines so far is definitely in the `technically ornate toy' category, but it seems like something that -somebody- must've done more work on[3]. I'm also fiddling around with visualisation of state transitions within the formal grammar I've alluded to elsewhere in this thread---but since I seem to be the only mad scientist on that particular hobby horse, I'm not expecting to find lots of other analysts with experiences to share. - -spb - ----- 0 The hustler(1) widget I mentioned earlier in this thread in fact started out as a hack to allow me to visually verify the results of some clustering code I was working on at the time. It (and the ability to generate phase space plots) are not terribly well documented at this point, but the code's there. 1 Most `cookbook' clustering models involve two or three variables, in which case presorting or indexing one or two variables often works out to be a Big Win. If you're looking at twenty-odd variables, attempting to order the data such that it will speed clustering may be a waste of time (indeed, unless some of the variables are covariant or fixed with respect to each other there may be no provably optimal ordering). There are almost certainly simplifying assumptions which can be made to reduce the work required, but these are will be purely empirical---and therefore subject to interpretation as to their validity. 2 Caveat: I've actually had some luck with using clustering methods to develop models for baseline behaviour: select variables which are time-independent; cluster pre-evaluated data in batch mode; map the resulting clusters onto the unit sphere; record the transformations. Then when new packets arrive, perform the same transformations on the new data and then see if the resulting point lies within the unit sphere. The problems/assumptions of this model are beyond the scope of this footnote, but they're definitely there. The method does appear to work (for some sufficiently broad definition of `work') for rudimentary anomaly detection, however. 3 There seems to be a lot of work out there on heuristic rule-generation algorithms in this context, but I've yet to find an application of this (to log analysis or intrusion detection) that hasn't, upon careful reflexion, turned out to be an elaborate way of rewriting your exisiting signatures. Again, I'd certainly -zealously- welcome evidence to the contrary. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (OpenBSD) iD8DBQFBKpz0G3kIaxeRZl8RAuVpAJ45di+b+X73CPmApjOtcp5jqAlGEwCgkfNu oC8CnmU6EyVtXJxL5Fj43yA= =0DwR -----END PGP SIGNATURE----- _______________________________________________ LogAnalysis mailing list LogAnalysis@private http://lists.shmoo.com/mailman/listinfo/loganalysis
This archive was generated by hypermail 2.1.3 : Mon Aug 23 2004 - 19:11:27 PDT