[ISN] Twisters, hurricanes, floods (oh my)

From: InfoSec News (isnat_private)
Date: Wed Sep 03 2003 - 22:35:29 PDT

  • Next message: InfoSec News: "[ISN] Blaster suspect 'surprised' at arrest"

    Forwarded from: William Knowles <wkat_private>
    
    http://www.computerworld.com/securitytopics/security/recovery/story/0,10801,84579,00.html
    
    Story by Matt Villano
    SEPTEMBER 03, 2003 
    CIO.com
     
    The evening of Sunday, May 4, 2003, at Aeneas Internet and Telephone 
    began as any previous Sunday evening had. The Jackson, Tenn.-based 
    company that serves about 10,000 Internet and 2,500 telephone 
    customers was closed for the weekend, awaiting the return of its 17 
    employees the next morning. Just before midnight, however, all hell 
    broke loose. An F-4 category twister touched down just outside of 
    town, then tore through Jackson's downtown area, leveling houses, 
    historical sites and municipal buildings alike. The tornado ripped 
    straight through Aeneas's one-story building, leaving only a pile of 
    rubble. 
    
    Meanwhile, Aeneas CIO and Operations Manager Josh Hart, who'd heard 
    about multiple tornadoes in the area that day, was home, 52 miles away 
    in Martin, Tenn., huddling in his bathroom with his family. As soon as 
    he was able, he flipped on the TV for news footage of the devastation. 
    What he saw looked like "a war zone," bricks and concrete everywhere 
    and piles upon piles of rubble. 
    
    At 2 a.m., with those images in the background, Hart's cell phone 
    rang--it was Aeneas Network Administrator Jason Warren calling from 
    what he likened to Ground Zero to report that everything in Jackson 
    was lost. Another call came in from CEO Jonathan Harlan. 
    
    "I'm listening to [Warren] tell me what it's like, and he says, 'It 
    doesn't even look like there was an office here,'" remembers Hart, 25. 
    "The tornado destroyed our computers, our desks, everything. I 
    couldn't believe what he was telling me." 
    
    Aeneas lost nearly $1 million in hardware and software that night, and 
    an estimated 72 hours of downtime. But just as Aeneas in Virgil's 
    Aeneid endured the worst the gods had to offer, so too did this 
    Aeneas. This one, however, was wise enough to have created a 
    contingency plan--one that minimized the damage and kept the company 
    afloat during its darkest hour. 
    
    The company is not alone. After a nationwide scramble to prepare for 
    high-impact, low-probability events similar to the attacks of Sept. 
    11, CIOs have since realized that their organizations are far more 
    likely to succumb to another type of event--one that has a high 
    probability of occurring and, curiously enough, is probably simpler to 
    predict: the weather. For example, in June, while the Atlantic 
    seaboard was bracing for the start of hurricane season, Arizona was 
    busy battling forest fires. And in Harris County, Texas, in 2001, a 
    tropical storm and resulting flood taught one IT executive the 
    importance of flexibility. 
    
    Both Aeneas's Hart and Steven W. Jennings, Harris County's executive 
    director of central technology, share their experiences here in an 
    effort to provide best practices and battle-tested secrets about which 
    preparations work best. According to Carol Kelly, vice president of 
    government strategies for Meta Group, these are lessons from which 
    everyone can learn. "When disaster strikes, you want to be ready with 
    a plan of action and an approach of how to deal," she says. "You might 
    be ready for the next terrorist attack, but if you're not ready for 
    the next nor'easter, your plans won't amount to much." 
    
    
    Big plans for a small company 
    
    Aeneas launched its contingency plan when it was founded in 1996; 
    since then, CIO Hart has enhanced the strategy gradually almost every 
    year. In early 2002, as the ISP neared 10,000 Internet customers, he 
    and his network administrator, Warren, thought up the company's most 
    comprehensive approach yet. While they determined that the likelihood 
    of a terrorist attack on the western Tennessee town of Jackson, 
    population 59,600, was slim to none, they concluded that because of 
    the municipality's location in the central U.S.'s infamous Tornado 
    Alley, the plan should respond to the next most likely cause of 
    disaster--twisters. What ensued was a three-pronged plan that hinged 
    upon colocation, distribution and backups. 
    
    * First, by employing Border Gateway Protocol (BGP) programming on a 
      high-class circuit shared with an ISP 90 miles down Interstate 40 in 
      Memphis, Aeneas would colocate in real-time its IP addresses and 
      reroute data traffic offsite during any local disruption. With this 
      system, servers would automatically reroute Internet service 
      operations the moment a disruption occurred. In theory, at least, 
      that would guarantee continuity of operations across the board. 
    
    * Next, the company distributed its voice traffic dynamically, paving 
      the way to switch its T1 connections from one fiber node in the Bell 
      South network to another, in the event of a sudden telecommunications 
      infrastructure failure. This system was designed to preserve 
      continuity much like the BGP system. 
     
    * Finally, the company's network administration team engineered 
      applications that stored customer records and other data on tape as 
      well as on backup hard drives. Though the tape and hard drives were 
      stored onsite at the Jackson location, Hart and Warren figured 
      onsite backup was better than none. 
    
    This strategy wasn't put to the test until tornado season this year, 
    when hardware, software and pieces of the local infrastructure were 
    destroyed May 4. Business customers on T1 lines lost their connections 
    as soon as the tornado struck. ISP traffic also went down immediately 
    and took 36 hours to restore. The fiber node switch to recover voice 
    traffic took a bit more time, as Aeneas programmers worked around the 
    clock with technicians from Bell South to migrate the T1 connections 
    from the old node to the new, finalizing the switch nearly three days 
    after the twister hit. 
    
    "When you have hundreds of T1 lines that need to be moved from one 
    node to the next, there's a lot of reengineering that needs to take 
    place," says Hart. "We thought we were prepared, but I'm not sure we 
    ever considered just how difficult this would be." 
    
    
    Bumps in the disaster recovery road 
    
    Beyond the challenges inherent in rerouting traffic, the remediation 
    effort hit two other snags. The first revolved around colocation; 
    because the colocation arrangement with the Memphis ISP was still 
    being set up at the time of the tornado, the Memphis site didn't yet 
    have sufficient servers. To remedy the situation, Aeneas staff 
    members--and family and friends--drove to Memphis with additional 
    equipment to handle the load. The company had some of this equipment 
    on hand--what it didn't have, Hart and Warren purchased online and had 
    overnighted to their homes. All told, colocation was down for about a 
    day and a half. 
    
    The larger and more formidable of the two setbacks involved the 
    company's tape and hard-drive backups. It was clear from the beginning 
    that most of the company's paper-based customer records had fallen 
    victim to Mother Nature, but four days after the tornado, Hart and 
    Warren discovered that the electronic tape and hard-drive backups had 
    failed as well. Hart finally uncovered the tape and hard drives May 8. 
    When he pulled the tape from the rubble, it was so badly damaged that 
    he hardly recognized it. Hart passed the hard drives on to a number of 
    local data recovery specialists to see if they could retrieve 
    anything. One by one, each came up empty. 
    
    Finally, as a last resort, Hart plucked the hard drives from four 
    different nonfunctioning computers and turned them over to Kroll 
    OnTrack, a data recovery company in Minneapolis. Miraculously, the 
    vendor discovered a recent copy of the customer records database on 
    all four computers and was able to recover all of the customer data 
    and return it to Aeneas, delaying printing of its May bills only 
    minimally. 
    
    
    Large organization, even larger plans 
    
    For an IT organization as small as Aeneas, the tornado presented 
    sizable challenges. But for the IT organization of Harris County, 
    Texas, which services more than 15,000 county employees and nearly 3.5 
    million constituents, the problems presented by Tropical Storm Allison 
    were downright monumental. 
    
    Disaster struck June 6, 2001--the second day of a five-day storm--when 
    atmospheric conditions caused a cloud to linger over the Houston area 
    for nearly six hours, dropping more than 39 inches of rain. By the 
    time the clouds parted, Harris County government had lost five 
    buildings and most of the communications and other hardware and 
    software in them to water damage. The price tag: a whopping $24 
    million. 
    
    Fortunately, though, Executive Director of Central Technology Steve 
    Jennings had prepared for such an event. When Jennings joined county 
    government in 1975, he established continuity planning to address 
    natural disasters, such as flooding and hurricanes. The plan, which he 
    dubbed the Four R strategy, hinges on four incremental steps--review, 
    rewire, relocate and rebuild. 
    
    With this in mind, Jennings attacked the recovery immediately, 
    following his plan like a bible. The morning after the deluge, he and 
    his top advisers met to review assets and assess damages. Next, 
    because Harris County is public and qualifies for federal aid, 
    Jennings called in the Federal Emergency Management Agency (FEMA) to 
    inspect the damage and lend him some disaster recovery expertise. He 
    also brought in NetVersant Solutions to lay new fiber-optic cables. 
    This process took approximately six weeks. In the meantime, Jennings 
    reconvened his advisers, and put together an emergency relocation plan 
    to disperse county employees into available office space on high, dry 
    ground. Three months later, he tapped into the first of several 
    batches of funding from FEMA to start rebuilding, spending millions on 
    treating buildings for water damage. 
    
    Jennings also worked double time to ensure that county communications 
    didn't miss a beat. "We utilized existing remote access facilities 
    that allowed county employees to dial in from home until their new 
    offices were finished," he says. This was done for employees whose 
    jobs were deemed critical to county operations and for those for whom 
    the county couldn't find alternative space. Jennings then mobilized a 
    force of technicians to install high-speed connections at the homes of 
    those employees who needed it most. 
    
    Finally, with the help of the county clerk's office, Jennings 
    activated a cache of 300 Cingular cell phones, which had been reserved 
    to help the blind vote on Election Day, and distributed them on an 
    as-needed basis to county departments. "Those phones are deactivated 
    for 11 months of the year, but they were available and we needed 
    them," he says, noting that network administrators deactivated the 
    phones and retrieved them once they managed to bring each department 
    back online. "Part of recovering from a disaster is making use of 
    everything you can find, and we did just that." When all was said and 
    done, it took the county about a year to return to normal, which, 
    according to Jennings, was pretty good given the scope of the damage. 
    
    
    Lessons learned 
    
    Jennings says the storm confirmed his belief that continuity plans 
    should be flexible and horizontally applicable. Before the flood, 
    Harris County's disaster recovery plan was conceived to respond to 
    potentially any disaster, but it typically addressed single events 
    such as the loss of a building, a network or a system. It was flexible 
    enough, however, that it worked even when the county was faced with 
    recovering multiple facilities. He adds that Harris County government 
    "uses different portions of the plan for total recovery." Today, the 
    Harris County continuity plan incorporates suggestions from employees 
    who were part of the recovery process and lists scenarios for various 
    "disaster combinations" that could occur during the next big 
    storm--such as what to do if both the jail and family court gets hit. 
    When that storm does happen, Jennings says he'll respond even faster 
    than he did in 2001. 
    
    The next time a weather event occurs, Jennings says he'll also have 
    the added benefit of wireless. After the flooding, as Jennings tried 
    to rewire the Harris County jail, he spent $200,000 on Lynx 
    high-definition wireless technology as an interim solution. The 
    technology worked so well that he kept it and now has it on hand to 
    pinch-hit during the next crisis. If, for example, a storm knocks out 
    phone lines in the southeast corner of the county, Harris can set up 
    wireless in hours. In addition, if another rainstorm waterlogs some of 
    the underground fiber optics downtown, Harris can use the technology 
    to provide emergency telephone service to anyone who needs it. 
    
    "Mother Nature never follows a script, especially not the one you 
    wrote," Jennings quips. "As we have more experience recovering from 
    the disasters she wields, we'll have a better sense of which remedies 
    work best." 
    
    At Aeneas, Hart notes that from "now until the end of time," he'll 
    keep an electronic records backup offsite to eliminate the problems he 
    endured in recovering those mission-critical customer files. Planning 
    for offsite backup had begun before the May tornado, and the site is 
    now up and running in Memphis. Hart admits that his error in planning 
    nearly cost Aeneas everything, adding that he'll never make that 
    mistake again. Another misstep Hart says he'd correct is the way he 
    handled the media in the days following the tornado. If he could do it 
    all over again, Hart says, he would have been on the phone immediately 
    with newspapers, TV stations and radio outlets to jump-start the 
    company's PR campaign and assuage customer concerns. 
    
    "[Our customers] must have been watching the TV news thinking, 'Man, 
    that's my ISP,' and we're too busy working on restoring systems to 
    think about putting their minds at ease," he says. "Restoring 
    technology after a disaster is important. But rebuilding customer 
    confidence...it doesn't get more important than that." 
    
    Matt Villano is a freelance writer based in Moss Beach, Calif. 
    
    
    
    *==============================================================*
    "Communications without intelligence is noise;  Intelligence
    without communications is irrelevant." Gen Alfred. M. Gray, USMC
    ----------------------------------------------------------------
    C4I.org - Computer Security, & Intelligence - http://www.c4i.org
    ================================================================
    Help C4I.org with a donation: http://www.c4i.org/contribute.html
    *==============================================================*
    
    
    
    
    -
    ISN is currently hosted by Attrition.org
    
    To unsubscribe email majordomoat_private with 'unsubscribe isn'
    in the BODY of the mail.
    



    This archive was generated by hypermail 2b30 : Thu Sep 04 2003 - 01:19:34 PDT