Re: [logs] Best Practices for Application Logging

From: Hal Snyder (halat_private)
Date: Wed Oct 10 2001 - 00:10:11 PDT

  • Next message: Fred Mobach: "Re: [logs] Best Practices for Application Logging"

    Tina Bird <tbird@precision-guesswork.com> writes:
    
    > Bruce Schneier and I are working on a whitepaper on best practices
    > for building an enterprise logging infrastructure (a direct result
    > of a couple of the conversations on this list). I've been pondering
    > what should go in it. Does any one out there have ideas or
    > references about de facto standards in logging for particular
    > applications or OSes? We were having a, um, discussion about Oracle
    > logging...
    
    Not sure exactly what "building an enterprise logging infrastructure"
    is, but here is an offering based on several years' experience with
    the data network for a small computer telephony company.
    
    As you will see, I have trouble separating logging from monitoring.
    
    HTH.
    
    1. There are two main kinds of log information.
    
       A. Management Logging.
    
       Logs generated by apps during normal operation. May be
       used for billing, customer reports, resource planning, etc.
       May be primary data source for these or secondary source used to
       confirm reports coming from applications.
    
       B. Operations Logging.
    
       Logs generated by the monitoring system. May result from polling
       or from triggered notification (traps); the distinction blurs when
       there are multiple layers of information gathering.
    
    2. Work with applications architects and programmers.
    
       You will get the most out of your log output if you can work with
       your software engineers on what will be produced as each application
       is written. For us, this means high-tier tech ops gets in on design
       meetings from time to time, and sees sneak previews of log output
       from apps under development.
    
       We do not have a single all-encompassing logging spec that must be
       followed for all resources. We tried that approach but found it led
       to lots of meetings and vague documents but nothing anyone wanted
       to use in real life with real delivery schedules. Instead, we have
       a small set of guidelines. The role of ops during design and
       programming is to offer gentle but persistent reminders of the
       guidelines.
    
    3. Log everything that is relevant, and nothing else.
    
       Code that formats a log message must include all obvious clues
       as to what is going on. This is just Error Messages 101:
    
         Examples of bad messages:
           transaction complete
           file not found
    
         Examples of better messages:
           application xyz001 transaction id 1234567 complete status 001
           service ttsd file /etc/ttsd.conf not found - aborting
    
       OTOH, it is not the job of a process at OSI Layer 7 to diagnose a
       problem at Layer 3, e.g. it should not go about issuing pings and
       such to put into the log message when a SQL INSERT fails.
    
       For each software system, decide what the atomic events are,
       and log all of them.
    
    4. Use existing open protocols and data formats.
    
       We don't want to reinvent the wheel, nor lock ourselves into a
       single vendor. Output in proprietary formats gets converted
       to ASCII before anything else happens.
    
       We use a pragmatic mix of syslog, SNMP, and ad hoc protocols to
       move logging information around. If SNMP is available for a
       resource, we use it, often polling and collecting traps on a
       syslog server; if it would take too long to support SNMP, we go
       with something quicker to implement.
    
       Data format is usually free-form ASCII. We gave up trying to guess
       today what fields are needed in tomorrow's logs.
    
    5. Use logging priorities consistently.
    
       There are at least three priorities at which to log:
    
         normal events (LOG_INFO)
           e.g.: transaction complete
         errors with no detected loss of resource (LOG_NOTICE)
           e.g.: invalid account number
         resource unavailable event (LOG_ERR)
           e.g.: server unreachable
    
       (When using syslog, we don't use all seven priorities.)
    
       Loss of resource and return of resource are logged at same
       priority.
    
       Don't bother putting in lots of knobs for various levels of
       verbosity and content - that's what grep is for!
    
    6. Keep log delivery simple.
    
       Keep to an absolute minimum the number of steps between the
       system creating log information and the person who needs it. The
       more complex a system is to configure and maintain, the less
       likely it is to be used. Avoid glitz and eye candy.
    
       The #1 most successful use of logging we have today simply scoops
       up new log content, looks for interesting items, and emails
       selected staff. This is after multiple generations of all sorts of
       more complicated stuff. We still run fancy GUIfied monitoring
       screens, but that is mainly for the visitors. :)
    
    7. Never trust a system to police itself.
    
       Every major resource must be monitored constantly for availability,
       or it will go away. But, systems need to be probed from the
       outside. What happens if someone trips over the power cord?
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: loganalysis-unsubscribeat_private
    For additional commands, e-mail: loganalysis-helpat_private
    



    This archive was generated by hypermail 2b30 : Wed Oct 10 2001 - 11:07:37 PDT