Re: [logs] What "should" be logged? (long)

littejoat_private

tbird> 1) What sort of state changes "should" applications and operating systems
tbird> log in the first place?  --> A standard for programmers
[and big list of possible categories]

I'll add one to that: resource utilization.  I'm thinking of questions
like "how much memory did it take to complete this task" (or
system/user time, or whatever).  This is going to be a lot more
application specific, but I want it in the interest of planning.
If I'm going to build a better server for foo, then these sorts of
questions are the ones I need answered.

tbird> - Object access: failed and successful attempts to read files, start or
tbird> stop processes, etc (understanding that most organizations will not need
tbird> or want this level of detail)

I'll expand this one: object processing.  What was done to an object
and the outcome of that action.  Mail queue ID was passed on to
recipient after passing through virus filter; packet was dropped
according to rule n...  I think that most organizations -would-
want this level of detail, at least for some applications, if only
for the pretty graphs they can generate.

tbird> 3) Given a particular operating system and/or system purpose, what are
tbird> (pick your favorite integer) 15 messages that pretty much always mean bad
tbird> news: that the system has been compromised, that a catastrophic failure
tbird> has happened, however we choose to define "bad news" for that "typical"
tbird> environment?  What >>is<< "bad news"?  Do we have sample data?

The problem here is that we can define a very small number of states
in which a machine can be thought of as working properly.  However,
there are many more states in which it is working improperly.  I'm
sure we could come up with 15 great signs for really bad news, but
I would argue that if you see one of those in your log file, you're
already hosed.  What I want is the news 15 minutes prior that tells
me the system has slipped out of optimal state.  Rarely does a
system go from fully functional to critical (except when I get my
rock hammer out) -- it slips, bit by bit, and we should be able to
detect this (we can't always -- go back to question 1 and start
logging the appropriate data).  The new red background on your
website was probably preceded by a number of those inoccuous looking
login failures, possibly from strange locations.  The disk failure
was likely preceded by SCSI bus errors.  And so on.

I'd be willing to crunch more sample log data, if you'd like.  Of
course, we could start the log parsing debate up again as well.  This
is a qustion of where we're going vs. where we are.  Perhaps my
suggestion would be to have a bunch of sample signatures that we could
pop into swatch or logcheck that would weed out (or in) some of the
most common messages.  Include comments.

  --rowan

-- 
John "Rowan" Littell
Systems Administrator
Earlham College Computing Services
http://www.earlham.edu/~littejo/