[ISN] The trouble with Rover is revealed

From: InfoSec News (isn@private)
Date: Mon Feb 23 2004 - 23:51:42 PST

  • Next message: InfoSec News: "[ISN] Deal awarded for DOD, FBI security plan"

    http://www.eetimes.com/story/OEG20040220S0046
    
    [While not security related, and tips the nightly batch of mail to 
    eight messages, I thought you all would be interested in NASA's little 
    hack to get Spirit working again. Definately click the URL above and 
    print this one out for the office bulletin board!  - WK]
    
    
    By Ron Wilson
    EE Times
    February 20, 2004
    
    SAN MATEO, Calif. - When the Mars rover Spirit went dark on Jan.21 a
    Jet Propulsion Laboratory team undertook to reprogram the craft's
    computer only to find themselves introducing an unpredictable
    sequences of events.
    
    The trouble with the Mars rover Spirit started much earlier in the
    mission than the day the craft stopped communicating with ground
    controllers.
    
    "It was recognized just after [the June 2003] launch that there were
    some serious shortcomings in the code that had been put into the
    launch load of software," said JPL data management engineer Roger
    Klemm. "The code was reworked, and a complete new memory image was
    uploaded to the spacecraft and installed on the rover shortly after
    launch."
    
    That appeared to fix the problems that had been identified with the
    initial load. But what no one at JPL could have anticipated was that
    the new load also made possible a totally implausible sequence of
    events that would, many months later, silence Spirit.
    
    The Spirit rover has a radiation-hardened R6000 CPU from
    Lockheed-Martin Federal Systems at the heart of the system. The
    processor accesses 120 Mbytes of RAM and 256 Mbytes of flash. Mounted
    in a 6U VME chassis, the processor board also has access to custom
    cards that interface to systems on the rover.
    
    The operating system is Wind River Systems' Vx-Works version 5.3.1,
    used with its flash file system extension. In operation, the real-time
    OS and all other executable code are RAM-resident.
    
    The flash memory stores executable images that are loaded into RAM at
    system boot. Separately, about 230 Mbytes are used to implement a
    flash file system that stores "data products," or data files that are
    created by the rover's subsystems and held for transmission to Earth.
    
    Among the data products are the images created by the rover's cameras.
    
    "Part of my responsibility in the data management team is to keep
    track of the data files that are created, transmitted and deleted on
    the rover during the mission," Klemm explained. "We recognized early
    in the planning process that the flash file system had a limited
    capacity for files. It is not just a limitation in the flash itself
    but also in the directory structure."
    
    Klemm explained that as data is collected by Spirit, files are created
    and stored in the flash file system until a communications window
    opens - an opportunity to transmit the data either directly to Earth
    or to one of the two orbiters circling the Red Planet. Then the files
    are transmitted. They are still held in the flash system until
    retrieved and error-corrected on Earth. If data is missing, requests
    are sent for retransmission. If the data is intact, a command is sent
    to delete the received files.
    
    "But there were also directories of files already placed into the file
    system in the launch load," Klemm said. "When we uploaded a new image
    to the rover, we recognized that those files would have to be deleted,
    because they were being replaced by a new set using different
    directories."
    
    Accordingly, on Martian day 15 (or "sol 15") of rover operation, a
    utility was uploaded to the rover to find and delete the old
    directories.
    
    Murphy strikes on Mars
    
    But the transmission that uploaded the utility was a partial failure:  
    Only one of the utility program's two parts was received successfully.  
    The second part was not received, and so in accordance with the
    communications protocol it was scheduled for retransmission on sol 19.
    
    Thus was the fuse lit on a software hand grenade.
    
    The data management team's calculations had not made any provision for
    leftover directories from a previous load still sitting in the flash
    file system.
    
    As Murphy would have it, earlier, sol 19 Spirit attempted to allocate
    more files than the RAM-based directory structure could accommodate.  
    That caused an exception, which caused the task that had attempted the
    allocation to be suspended. That in turn led to a reboot, which
    attempted to mount the flash file system. But the utility software was
    unable to allocate enough memory for the directory structure in RAM,
    causing it to terminate, and so on.
    
    Spirit fell silent, alone on the emptiness of Mars, trying and trying
    to reboot. And its human handlers at JPL seemed at a loss to help,
    unable to diagnose a system they could not see.
    
    Luckily, early in the process of proposing failure scenarios, someone
    remembered the earlier failure to upload the second piece of the
    utility. The scenario was modeled, and it was discovered that a
    VxWorks flag that causes a task to be suspended on a memory allocation
    failure was set in the existing image.
    
    "The irony of it was that the operating system was doing exactly what
    we'd told it to do," Klemm lamented.Working on the theory that the
    rover was in fact listening and rebooting, the team commanded Spirit
    to reboot without mounting the flash file system.
    
    The team then uploaded a script of low-level file manipulation
    commands that worked directly on the flash memory without mounting the
    volume or building the directory table in RAM. Using the low- level
    commands, about a thousand files and their directories - the leftovers
    from the initial launch load - were removed.
    
    "At that point we mounted the flash file system and ran a checkdisk
    utility," Klemm said. To everyone's enormous relief, the mount was
    successful.
    
    "As we had anticipated, there was some corruption from the event, so
    that was corrected," Klemm added. "In the process of going through the
    contents of the file system, we discovered a system log in which the
    problem was documented, step by step, right up to the allocation
    request that failed."
    
    Klemm said that with the leftover directories and their files removed,
    the system is now functioning well. But just in case, the team is
    working on an exception-handler routine that will more gracefully
    recover from an allocation failure.
    
    As a postscript, Klemm noted that the other day he heard a car
    commercial on the radio that made reference to the Mars rover,
    comparing, for example, the car's speed over the ground to Spirit's.  
    In the process of touting the car's extended-warranty program, the ad
    noted that the Mars rover came with "interplanetary roadside
    assistance." "That phrase just stuck in my mind," Klemm said. " love
    it."
    
    
    
    -
    ISN is currently hosted by Attrition.org
    
    To unsubscribe email majordomo@private with 'unsubscribe isn'
    in the BODY of the mail.
    



    This archive was generated by hypermail 2b30 : Tue Feb 24 2004 - 06:48:08 PST