[ISN] Tracking the Blackout bug

From: InfoSec News (isn@private)
Date: Fri Apr 09 2004 - 01:09:05 PDT

  • Next message: InfoSec News: "[ISN] Linux Advisory Watch - April 9th 2004"

    http://www.theregister.co.uk/2004/04/08/blackout_bug_report/
    
    By Kevin Poulsen
    SecurityFocus
    8th April 2004
    
    A number of factors and failings came together to make the August 14th 
    northeastern blackout the worst outage in North American history. One 
    of them was buried in a massive piece of software compiled from four 
    million lines of C code and running on an energy management computer 
    in Ohio.
    
    To nobody's surprise, the final report on the blackout released by a 
    US-Canadian task force Monday puts most of blame for the outage on 
    Ohio-based FirstEnergy Corp., faulting poor communications, inadequate 
    training, and the company's failure to trim back trees encroaching on 
    high-voltage power lines. But over a dozen of task force's 46 
    recommendations for preventing future outages across North America are 
    focused squarely on cyberspace.
    
    That may have something to do with the timing of the blackout, which 
    came three days after the relentless Blaster worm began wreaking havoc 
    around the Internet - a coincidence that prompted speculation at the 
    time that the worm, or the traffic it was generating in its efforts to 
    spread, might have triggered or exacerbated the event. When US and 
    Canadian authorities assembled their investigative teams, they 
    included a computer security contingent tasked with looking 
    specifically at any cybersecurity angle on the outage.
    
    In the end, it turned out that a computer snafu actually played a 
    significant role in the cascading blackout - though it had nothing to 
    do with viruses or cyber terrorists. A silent failure of the alarm 
    function in FirstEnergy's computerized Energy Management System (EMS) 
    is listed in the final report as one of the direct causes of a 
    blackout that eventually cut off electricity to 50 million people in 
    eight states and Canada.
    
    The alarm system failed at the worst possible time: in the early 
    afternoon of August 14th, at the critical moment of the blackout's 
    earliest events. The glitch kept FirstEnergy's control room operators 
    in the dark while three of the company's high voltage lines sagged 
    into unkempt trees and "tripped" off. Because the computerized alarm 
    failed silently, control room operators didn't know they were relying 
    on outdated information; trusting their systems, they even discounted 
    phone calls warning them about worsening conditions on their grid, 
    according to the blackout report.
    
    "Without a functioning alarm system, the [FirstEnergy] control area 
    operators failed to detect the tripping of electrical facilities 
    essential to maintain the security of their control area," reads the 
    report. "Unaware of the loss of alarms and a limited EMS, they made no 
    alternate arrangements to monitor the system."
    
    With the FirstEnergy control room blind to events, operators failed to 
    take actions that could have prevented the blackout from cascading out 
    of control.
    
    In the aftermath, investigators quickly zeroed in on the Ohio 
    line-tripping as a root cause. But the reason for the alarm failure 
    remained a mystery. Solving that mystery fell squarely on the 
    corporate shoulders of GE Energy, makers of the XA/21 EMS in use at 
    FirstEnergy's control center. According to interviews, a half-a-dozen 
    workers at GE Energy began working feverishly with the utility and 
    with energy consultants from KEMA Inc. to figure out what went wrong.
    
    The XA/21 isn't based on Windows, so it couldn't have been infected by 
    Blaster, but the company didn't immediately rule out the possibility 
    that the worm somehow played a role in the alarm failure. "In the 
    initial stages, nobody really knew what the root cause was," says Mike 
    Unum, manager of commercial solutions at GE Energy. "We spent a 
    considerable amount of time analyzing that, trying to understand if it 
    was a software problem, or if - like some had speculated - something 
    different had happened."
    
    Sometimes working late into the night and the early hours of the 
    morning, the team pored over the approximately one-million lines of 
    code that comprise the XA/21's Alarm and Event Processing Routine, 
    written in the C and C++ programming languages. Eventually they were 
    able to reproduce the Ohio alarm crash in GE Energy's Florida 
    laboratory, says Unum. "It took us a considerable amount of time to go 
    in and reconstruct the events." In the end, they had to slow down the 
    system, injecting deliberate delays in the code while feeding alarm 
    inputs to the program. About eight weeks after the blackout, the bug 
    was unmasked as a particularly subtle incarnation of a common 
    programming error called a "race condition," triggered on August 14th 
    by a perfect storm of events and alarm conditions on the equipment 
    being monitoring. The bug had a window of opportunity measured in 
    milliseconds.
    
    "There was a couple of processes that were in contention for a common 
    data structure, and through a software coding error in one of the 
    application processes, they were both able to get write access to a 
    data structure at the same time," says Unum. "And that corruption lead 
    to the alarm event application getting into an infinite loop and 
    spinning."
    
    
    Testing for Flaws
    
    "This fault was so deeply embedded, it took them weeks of poring 
    through millions of lines of code and data to find it," FirstEnergy 
    spokesman Ralph DiNicola said in February.
    
    After the alarm function crashed in FirstEnergy's controls center, 
    unprocessed events began to cue up, and within half-an-hour the EMS 
    server hosting the alarm process folded under the burden, according to 
    the blackout report. A backup server kicked-in, but it also failed. By 
    the time FirstEnergy operators figured out what was going on and 
    restarted the necessary systems, hours had passed, and it was too 
    late.
    
    This week's blackout report recommends that the U.S. and Canadian 
    governments require all utilities using the XA/21 to check in with GE 
    Energy to ensure "that appropriate actions have been taken to avert 
    any recurrence of the malfunction." GE Energy says that's a moot 
    point: though the flaw has not manifested itself elsewhere, last fall 
    the company gave its customers a patch against the bug, along with 
    installation instructions and a utility to repair any alarm log data 
    corrupted by the glitch. According to Unum, the company sent the 
    package to every XA/21 customer - more than 100 utilities around the 
    world - and offered to help install it, "irrespective of their current 
    support status," he says.
    
    The company did everything it could, says Unum. "We text exhaustively, 
    we test with third parties, and we had in excess of three million 
    online operational hours in which nothing had ever exercised that 
    bug," says Unum. "I'm not sure that more testing would have revealed 
    that. Unfortunately, that's kind of the nature of software... you may 
    never find the problem. I don't think that's unique to control systems 
    or any particular vendor software."
    
    Tom Kropp, manager of the enterprise information security program at 
    the Electric Power Research Institute, an industry think tank, agrees. 
    He says faulty software may always be a part of the electric grid's 
    DNA. "Code is so complex, that there are always going to be some 
    things that, no matter how hard you test, you're not going to catch," 
    he says. "If we see a system that's behaving abnormally well, we 
    should probably be suspicious, rather than assuming that it's behaving 
    abnormally well."
    
    But Peter Neumann, principal scientist at SRI International and 
    moderator of the Risks Digest, says that the root problem is that 
    makers of critical systems aren't availing themselves of a large body 
    of academic research into how to make software bulletproof.
    
    "We keep having these things happen again and again, and we're not 
    learning from our mistakes," says Neumann. "There are many possible 
    problems that can cause massive failures, but they require a certain 
    discipline in the development of software, and in its operation and 
    administration, that we don't seem to find. ... If you go way back to 
    the AT&T collapse of 1990, that was a little software flaw that 
    propagated across the AT&T network. If you go ten years before that 
    you have the ARPAnet collapse.
    
    "Whether it's a race condition, or a bug in a recovery process as in 
    the AT&T case, there's this idea that you can build things that need 
    to be totally robust without really thinking through the design and 
    implementation and all of the things that might go wrong," Neumann 
    says.
    
    Despite the absence of cyber terrorism in the blackout's genesis, the 
    final report includes 13 recommendations focused squarely on 
    protecting critical power-grid systems from intruders. The computer 
    security prescriptions came after task force investigators discovered 
    that the practices of some of the utility companies involved in the 
    blackout created "potential opportunities for cyber system compromise" 
    of EMS computers.
    
    "Indications of procedural and technical IT management vulnerabilities 
    were observed in some facilities, such as unnecessary software 
    services not denied by default, loosely controlled system access and 
    perimeter control, poor patch and configuration management, and poor 
    system security documentation," reads the report.
    
    Among the recommendations, the task force says cyber security 
    standards established by the North America Electric Reliability 
    Council, the industry group responsible for keeping electricity 
    flowing, should be vigorously enforced. Joe Weiss, a control system 
    cyber security consultant at KEMA, and one of the authors of the NERC 
    standards, says that's a good start. ""The NERC cyber security 
    standards are very basic standards," says Weiss. "They provide a 
    minimum basis for due diligence."
    
    But so far, it seems software failure has had more of an effect on the 
    power grid than computer intrusion. Nevertheless, both Weiss and 
    EPRI's Kropp believe that the final report is right to place more 
    emphasis on cybersecurity than software reliability. "You don't try to 
    look for something that's going to occur very, very, very 
    infrequently," says Weiss. "Essentially, a blackout like this was 
    something like that. There are other issues that are higher 
    probability that need to be addressed."
    
    
    
    _________________________________________
    ISN mailing list
    Sponsored by: OSVDB.org
    



    This archive was generated by hypermail 2b30 : Fri Apr 09 2004 - 03:51:53 PDT