[logs] Re: Signatures

spb@private

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Marcus J. Ranum writes:

>Let me point out that:
>a) you're right
>b) signatures have their place

Your first point is very good.  But I certainly don't disagree with the
second.  Hell, I've even committed the heinous act of unleashing yet another
open source signature-based IDS on the world.

Everybody knows the aphorism, `If your only tool is a hammer, all your
problems look like nails,' right?  Signature-based analysis is the
hammer of the security industry.  Vendors have been pimping quote
solutions unquote that bundle fifteen hundred nearly identical hammers,
often to customers who don't even know what a nail looks like.  In the
end, we discover we've developed a lot of really sexy hammer technology,
and we have plush custom hammers available for hundreds of distinct kinds
of nails.

I'm not saying that I don't want a hammer in my toolbox, I'm just saying
that not all of our problems are nails (I'm pretty sure there are a nonzero
number of screws loose as well).

>True signatureless systems generate results like: "the ratio of SYN to FIN
>packets is 2 standard deviations from the norm for this time of the day."
>They leave it entirely up to you to figure out the significance.

Well, yes and no.  Imagine a system that uses some hideously byzantine
algorithm to profile network traffic.  It coughs out a summary which
is isomorphic to a BPF filter which will match all normal traffic (for
some sufficiently specified definition of `normal').  A perl script
translates this into its inverse and runs, oh, tcpdump(8) on passing
traffic with the resulting filter.

The internal mechanisms are certainly the same as a vanilla signature-based
system, but I think it does some violence to the term if we call the
system as a whole `signature based'.  If we're not actually enumerating
characteristics and associating them with some tag, then we're not writing
signatures---any more than we're writing strings of ANDs and NOTs if
we're coding in C (even though the end result is provably isomorphic to
a collection of ANDs and NOTs).

So, bringing this back to the context of my comments, my complaint is
not that we use signatures (we should---we'd be nuts not to).  My complaint
is that the narrow focus on signature-based methods reinforces a lot
of bad habits in data analysis and collection (in the same way that
firewalls are useful gadgets, but reliance on them results in a lot of
bad network design decisions---or, indeed, networks being built without
being designed at all).

To clarify my point and put this more firmly into the context of log
analysis, here's an example:

Take a log file.  Come up with a list of regexen for things that you
consider interesting which may appear in the logfile, and actions you want
taken when they happen (i.e., send mail or put a blinking red light on the
web page):

Conventionally, you would improve this system by:

	-Enumerating more and more interesting things
	-Making more and more elaborate regexen
	-Writing more elaborate response actions

...and so on.  What is the limiting factor going to be?  Without dealing
with all the cases and specifics, my contention is that the limiting
factor is that the match-some-characteristic mechanism is inherently
merely a lexical categorisation.  In other words, your pattern will
never contain more information than the pattern itself.  This is (obviously)
tautological:  if you're throwing a flag when you see a log line that
matches some regex, all the flag means is that some log line matches the
regex.  We might -assume- that this corresponds to some underlying
condition (the web server just got hit by sasser, someone just logged
in via ssh(1), or whatever), but that's not what your signature is actually
telling you.

This is because the simple lexical analysis that your signatures are doing
cannot convey semantic content.  In other words, they are only testing for
the presence or absence of certain characteristics in the data, not
evaluating the `meaning' of those characteristics or the data.  This is
why signature-based systems are lousy at enunciating things like risk
analyses or even reporting anomalies---they simply lack the expressive
power from the -underlying structure of the system-.  The example I like
to use is that using a signature system to evaluate the meaning of some
event is like trying to figure out what some C code will do by grepping
for keywords in the source code.

So instead, imagine that we understand the tags we associate with our
logfile-searching regexen to be lexical tokens---like the content of a
lex(1) input file.  So we can then construct a grammar which expresses
the relationships between these tokens---analogous to a yacc(1) input.
Then we have -enormously- greater expressive power in which to search
for things, evaluate the presence or absence of interesting conditions,
or (importantly) make statements about the condition of systems or networks.

Note that this is -not- merely a system for aggregating conditions (i.e.,
reporting that three regexen (rather than one) have been matched)---although
it certainly encompasses this sort of thing.  If, indeed, we were to
use lex(1) (or flex(1)) and yacc(1) to construct our grammar (which is
what I've been doing) then the resulting system is capable of exactly
as much expressive power as any LALR grammar.

Now, at the heart of the system, we're still using signatures---we're
still playing match-the-regex.  But this can, I think, be meaningfully
called a non-signature-based system.  Or at least it is only signature
based in the sense that, say, C is.

There's nothing magic about this formulation, mind you.  I think it is
substantially different from a vanilla signature-matching system, and I
think it's substantially different from the aggregation/correlation
systems I've seen.  The reason why I bring it up is that this model
highlights the limitations of the signature model (by explicitly drawing
the parallel to a compiler's lexical analyzer).

>That's 1/2 of the problem!! The OTHER 1/2 the problem is how to encode
>ignorance (anti-knowledge) into our security systems!!!!!!!
>Nobody has tried this, yet. But what it someone tried to do "artificial ignora
>nce" in an IDS: model what everything that's OK looks like and alert whenever
>traffic occurs that doesn't fire an "ignore this" signature. Note to readers; 
>I hereby disclose this as prior art so if some idiot patents the idea, we can
>all point to this posting. ;)

I think there's art prior to your prior art.  I suggest this as
the default mode of operation in the shoki documentation, and I know
I wasn't the first one to come up with the idea.  Isn't it in Denning
and Neumann's model for IDES?

>Put another way: it's easier to know who your friends are, than to keep
>track of all your enemies IF and ONLY IF you have fewer friends than
>enemies. ;)

Everything you need to know about information security you can learn
from the Mafia[0].

- -spb

- -----
0	Well, not -everything-.  But why screw up a perfectly good
	aphorism with a qualification?

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (OpenBSD)

iD8DBQFBJmphG3kIaxeRZl8RAmVqAJ9gxBwv+PUQfHLhmKN9t/nwYaqWbgCeImnG
AMvVUJ97ziV9OgSZb4L2VbI=
=spU9
-----END PGP SIGNATURE-----
_______________________________________________
LogAnalysis mailing list
LogAnalysis@private
http://lists.shmoo.com/mailman/listinfo/loganalysis