Re: [logs] Visual Event Analysis WAS: most popular reports...?

From: Stephen P. Berry (spb@private)
Date: Mon Aug 23 2004 - 18:42:33 PDT


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>Much of the discussion to date has been about analysis (visual or otherwise)
>of the 'raw'data i.e. without adding anything to data. There has been some
>talk of statistical processing but there is scope for further processing for
>example 'clustering' of events to aid in anomaly detection or modeling of
>data to determine which attributes in the data are most significant in
>determining another attribute.

The two comments:

	-The drawback to clustering is that it is computationally
	 intensive.  In my (perhaps biased) experience[0] the only
	 clustering algorithms that consistently yield interesting
	 results with intrusion/anomaly data are density-based
	 clustering algorithms, most of which are of O(n^2) complexity
	 in a vanilla formulation or O(n*log(n)) with some optimisations
	 (i.e., indexing or presorting of the data, which may or may
	 not be feasible depending on the problem space[1]).
	-Factor analysis doesn't appear to work.  Well, except for a
	 few very trivial cases (i.e., noisy, sequential portscans and
	 suchlike).  If there's some nontrivial class of events which you've
	 found amenible to factor analysis, I'd be delighted to hear about
	 it (honestly).

Both of these (and particularly the former) should be read as coming from
someone who does log analysis primarily for purposes of intrusion detection.
I'd be entirely willing to believe there are other subsets of `pure'
log analysis for which my comments do not hold true.  It is also possible
that these techniques have more general applicability with smaller datasets.
But when you start looking at tens of millions of datapoints, clustering
becomes useful only in the most academic sense[2].


What I'm actually surprised nobody else has mentioned in this context is
using visualisation of lexical analysis of log data.  I haven't actually
done much with this so far, but what I've been fiddling with lately has
been computing lexical distances between log entries of known values (known
good or known bad) and using clustering techniques on the result.  Everything
I've done along these lines so far is definitely in the `technically
ornate toy' category, but it seems like something that -somebody- must've
done more work on[3].

I'm also fiddling around with visualisation of state transitions within
the formal grammar I've alluded to elsewhere in this thread---but since I
seem to be the only mad scientist on that particular hobby horse, I'm
not expecting to find lots of other analysts with experiences to share.





- -spb

- -----
0	The hustler(1) widget I mentioned earlier in this thread in fact
	started out as a hack to allow me to visually verify the results of
	some clustering code I was working on at the time.  It (and the
	ability to generate phase space plots) are not terribly well
	documented at this point, but the code's there.
1	Most `cookbook' clustering models involve two or three variables,
	in which case presorting or indexing one or two variables often
	works out to be a Big Win.  If you're looking at twenty-odd variables,
	attempting to order the data such that it will speed clustering
	may be a waste of time (indeed, unless some of the variables are
	covariant or fixed with respect to each other there may be no
	provably optimal ordering).
	There are almost certainly simplifying assumptions which can be
	made to reduce the work required, but these are will be
	purely empirical---and therefore subject to interpretation as
	to their validity.
2	Caveat:  I've actually had some luck with using clustering
	methods to develop models for baseline behaviour:  select variables
	which are time-independent; cluster pre-evaluated data in batch
	mode; map the resulting clusters onto the unit sphere; record
	the transformations.  Then when new packets arrive, perform the
	same transformations on the new data and then see if the resulting
	point lies within the unit sphere.
	The problems/assumptions of this model are beyond the scope of
	this footnote, but they're definitely there.  The method does
	appear to work (for some sufficiently broad definition of `work')
	for rudimentary anomaly detection, however.
3	There seems to be a lot of work out there on heuristic rule-generation
	algorithms in this context, but I've yet to find an application
	of this (to log analysis or intrusion detection) that hasn't, upon
	careful reflexion, turned out to be an elaborate way of rewriting
	your exisiting signatures.  Again, I'd certainly -zealously- welcome
	evidence to the contrary.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (OpenBSD)

iD8DBQFBKpz0G3kIaxeRZl8RAuVpAJ45di+b+X73CPmApjOtcp5jqAlGEwCgkfNu
oC8CnmU6EyVtXJxL5Fj43yA=
=0DwR
-----END PGP SIGNATURE-----
_______________________________________________
LogAnalysis mailing list
LogAnalysis@private
http://lists.shmoo.com/mailman/listinfo/loganalysis



This archive was generated by hypermail 2.1.3 : Mon Aug 23 2004 - 19:11:27 PDT