Re: [logs] Best Practices for Application Logging

From: Hal Snyder (halat_private)
Date: Wed Oct 10 2001 - 00:10:11 PDT

Next message: Fred Mobach: "Re: [logs] Best Practices for Application Logging"

Previous message: Eric Fitzgerald: "RE: [logs] Auditing on Win2k Domain Controller"
In reply to: Tina Bird: "[logs] Best Practices for Application Logging"
Next in thread: Fred Mobach: "Re: [logs] Best Practices for Application Logging"
Reply: Fred Mobach: "Re: [logs] Best Practices for Application Logging"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Tina Bird <tbird@precision-guesswork.com> writes:

> Bruce Schneier and I are working on a whitepaper on best practices
> for building an enterprise logging infrastructure (a direct result
> of a couple of the conversations on this list). I've been pondering
> what should go in it. Does any one out there have ideas or
> references about de facto standards in logging for particular
> applications or OSes? We were having a, um, discussion about Oracle
> logging...

Not sure exactly what "building an enterprise logging infrastructure"
is, but here is an offering based on several years' experience with
the data network for a small computer telephony company.

As you will see, I have trouble separating logging from monitoring.

HTH.

1. There are two main kinds of log information.

   A. Management Logging.

   Logs generated by apps during normal operation. May be
   used for billing, customer reports, resource planning, etc.
   May be primary data source for these or secondary source used to
   confirm reports coming from applications.

   B. Operations Logging.

   Logs generated by the monitoring system. May result from polling
   or from triggered notification (traps); the distinction blurs when
   there are multiple layers of information gathering.

2. Work with applications architects and programmers.

   You will get the most out of your log output if you can work with
   your software engineers on what will be produced as each application
   is written. For us, this means high-tier tech ops gets in on design
   meetings from time to time, and sees sneak previews of log output
   from apps under development.

   We do not have a single all-encompassing logging spec that must be
   followed for all resources. We tried that approach but found it led
   to lots of meetings and vague documents but nothing anyone wanted
   to use in real life with real delivery schedules. Instead, we have
   a small set of guidelines. The role of ops during design and
   programming is to offer gentle but persistent reminders of the
   guidelines.

3. Log everything that is relevant, and nothing else.

   Code that formats a log message must include all obvious clues
   as to what is going on. This is just Error Messages 101:

     Examples of bad messages:
       transaction complete
       file not found

     Examples of better messages:
       application xyz001 transaction id 1234567 complete status 001
       service ttsd file /etc/ttsd.conf not found - aborting

   OTOH, it is not the job of a process at OSI Layer 7 to diagnose a
   problem at Layer 3, e.g. it should not go about issuing pings and
   such to put into the log message when a SQL INSERT fails.

   For each software system, decide what the atomic events are,
   and log all of them.

4. Use existing open protocols and data formats.

   We don't want to reinvent the wheel, nor lock ourselves into a
   single vendor. Output in proprietary formats gets converted
   to ASCII before anything else happens.

   We use a pragmatic mix of syslog, SNMP, and ad hoc protocols to
   move logging information around. If SNMP is available for a
   resource, we use it, often polling and collecting traps on a
   syslog server; if it would take too long to support SNMP, we go
   with something quicker to implement.

   Data format is usually free-form ASCII. We gave up trying to guess
   today what fields are needed in tomorrow's logs.

5. Use logging priorities consistently.

   There are at least three priorities at which to log:

     normal events (LOG_INFO)
       e.g.: transaction complete
     errors with no detected loss of resource (LOG_NOTICE)
       e.g.: invalid account number
     resource unavailable event (LOG_ERR)
       e.g.: server unreachable

   (When using syslog, we don't use all seven priorities.)

   Loss of resource and return of resource are logged at same
   priority.

   Don't bother putting in lots of knobs for various levels of
   verbosity and content - that's what grep is for!

6. Keep log delivery simple.

   Keep to an absolute minimum the number of steps between the
   system creating log information and the person who needs it. The
   more complex a system is to configure and maintain, the less
   likely it is to be used. Avoid glitz and eye candy.

   The #1 most successful use of logging we have today simply scoops
   up new log content, looks for interesting items, and emails
   selected staff. This is after multiple generations of all sorts of
   more complicated stuff. We still run fancy GUIfied monitoring
   screens, but that is mainly for the visitors. :)

7. Never trust a system to police itself.

   Every major resource must be monitored constantly for availability,
   or it will go away. But, systems need to be probed from the
   outside. What happens if someone trips over the power cord?

---------------------------------------------------------------------
To unsubscribe, e-mail: loganalysis-unsubscribeat_private
For additional commands, e-mail: loganalysis-helpat_private

Next message: Fred Mobach: "Re: [logs] Best Practices for Application Logging"
Previous message: Eric Fitzgerald: "RE: [logs] Auditing on Win2k Domain Controller"
In reply to: Tina Bird: "[logs] Best Practices for Application Logging"
Next in thread: Fred Mobach: "Re: [logs] Best Practices for Application Logging"
Reply: Fred Mobach: "Re: [logs] Best Practices for Application Logging"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b30 : Wed Oct 10 2001 - 11:07:37 PDT