FC: Software problems apparently caused Osprey aircraft crash

From: Declan McCullagh (declanat_private)
Date: Wed May 09 2001 - 15:46:49 PDT

  • Next message: Declan McCullagh: "FC: Responses to cost of privacy study from Swire, Smith, Sholtz"

    [This is from Peter Neumann's RISKS Digest, Volume 21, Issue 38. Background 
    on Osprey crashes: 
    http://www.dallasnews.com/business/stories/359198_bell_06bus.ART.html
    http://www.newsobserver.com/thursday/news/editorials/Story/430922p-424076c.html 
    --DBM]
    
    ----------------------------------------------------------------------
    
    Date: Mon, 07 May 2001 17:22:56 +0200
    From: "Peter B. Ladkin" <ladkinat_private-bielefeld.de>
    Subject: Partial Causal Analysis of the December 2000 Osprey Accident
    
    Acknowledgement
    
    Credit for the following line of reasoning is due in large part to the New
    Scientist reporter Duncan Graham-Rowe. The formulation here is (obviously)
    mine.
    
    Disclaimer
    
    The interpretation and reasoning presented here is based entirely on the
    publicly-available JAG briefing and Blue Ribbon Panel report documents,
    which I shall refer to as JAGB and BRPR respectively.  Although written to
    be readable by non-specialists, it employs methods to be found in formal
    failure analysis techniques such as WBA. There may be errors in the
    reasoning and analysis, although I am reasonably confident that all such
    errors are minor.  Please bring any errors you remark to my immediate
    attention (email usually suffices). The focus of this note is causal
    analysis and I have nothing to say here about social phenomena such as blame
    or responsibility.
    
    The Sequence of Events
    
    First, a brief review of what the JAG determined happened in the December
    crash. A hydraulic line ruptured in the left nacelle. This line was part of
    the primary flight control system hydraulics and activates the swashplate
    actuators. There are three such systems, in a partially redundant
    configuration. At the rupture point, the line was common to Systems 1 and 3;
    System 1 was fully disabled, System 3 was isolated in the left nacelle, but
    continued to function in the right nacelle, System 2 worked left and right.
    
    This event caused the nacelle transition to stop, and the PFCS reset button
    to illuminate in the cockpit. The aircrew pressed the reset button, as per
    procedure. The PFCS computer software then caused "rapid" pitch and thrust
    changes to be commanded and actuated. The rotors responded differentially in
    time, because the physical actuation authority in each nacelle was
    different: the right nacelle had two working hydraulic systems, and the left
    nacelle only one. The aircrew pressed the reset button "as many as eight to
    10 times [sic]" (JAGB) during the last 20 seconds of flight. The response
    asymmetry and resulting flight behavior of the aircraft was directly
    responsible for loss of control (LOC) of the aircraft and the aircraft
    impacted the ground in a LOC condition.
    
    Proposed Failure Analysis based on JAGB
    
    JAGB says: "The published procedure for responding to [a hydraulic system
    failure as multiply indicated in the cockpit] is to press the primary flight
    control system reset button. When the primary flight control reset button
    was pressed, a software anomaly caused significant pitch and thrust changes
    in both prop rotors. Because of the dual hydraulic failure on the left side,
    the prop rotors were unable to respond at the same rate. This resulted in
    uncommanded aircraft pitch, roll and yaw motions, which eventually stalled
    the aircraft.
    
    During the last 20 seconds of the flight, the primary flight control reset
    logic was energized as many as eight to 10 times. This, coupled with the
    dual hydraulic failure, caused large prop rotor changes. These changes
    resulted in decreased airspeed and altitude and a left yaw. The crew pressed
    the reset button in their attempt to reset the system and maintain control
    during the emergency."
    
    This clearly says
       (*) that the PFCS software caused the PFCS to command "significant
           pitch and thrust changes" and that this software behavior was
           anomalous;
       (**) that recycling the reset button "eight to 10 times" was a causal
            factor (in the WBA sense) of "large prop rotor changes", which
            were in turn a causal factor (we might wish to infer: the sole
            causal factor) in the LOC.
    
    What kind of "software anomaly" can this have been? There are, according to
    a common taxomony of complex system failures, only a few possibilities. (I
    shall use the term "rotor excursion" for a one-time pitch and thrust change
    of the sort being talked about here, whatever that may consist in.)
    
    1. A bug: the code did not fulfill the design specification; or
    
    2. The software functioned as designed, but the design was incompatible
        with the overall PFCS control requirements (which implied for this
        situation that the PFCS should not command a rotor excursion); or
    
    3. The software functioned as designed, and the design was compatible
        with the overall PFCS control requirements (that is, the PFCS
        requirements allowed or even implied that a rotor excursion
        should take place in this situation), but rotor excursions
        in this situation were not "expected" nor required by
            (a) the aircraft designers;
            (b) OPS manual;
            (c) crew; or
    
    4. That although a rotor excursion may have been anticipated by
        designers, that the effects of multiple cycling of reset, namely,
        multiple rotor excursions, were not anticipated by
            (a) aircraft engineers;
            (b) OPS manual;
            (c) crew.
    
    I shall say "software bug" for case 1, "software design error" for case 2,
    "requirements failure" for cases 3 and 4. I concluded in my Risks-21.33 note
    that the JAG had unequivocally indicated a software bug or a software design
    error had occurred. I will give my precise reasoning forthwith and I believe
    that reasoning is correct. However, after consulting and analysing BRPR, I
    see reason now to doubt the truth of the conclusion.
    
    JAGB Quotes
    
    In the light of these possibilities, following comments from JAGB are
    relevant. The briefer is Maj. Gen. Berndt, assisted by Lt. Col.  Wainwright
    on aircraft technical matters. If unascribed, the quotes are from
    Maj. Gen. Berndt.
    
    A. "an anomaly in the control logic in the computer software control
        laws which caused rapid and significant changes to prop rotor pitch
        each time the primary flight control system reset logic was
        energized."
    
    B. "This anomaly rendered procedures outlined in the [...] NATOPS
        flight manual ineffective."
    
    C. "This mishap was not the result of human factors."
    
    D. [in response to a query: "what does "anomaly" mean exactly?"] "An
        anomaly [here] means that something happened that was not supposed
        to happen, and whether that's a fault of design or structure or
        composition, manufacture or installation, [Maj. Gen, Berndt] do[es]
        not know."
    
    E. [Lt. Col. Wainwright] "The question was what should have happened
        when the PFCS button was reset with the dual hydraulic failure. The
        short answer is absolutely nothing."
    
    F. "The recommendation has been to address the anomaly within the system
        that caused the aircraft to accelerate and decelerate with rapid
        pitch changes over a short period of time."
    
    G. [Wainwright] "The [reset] button is multipurpose. In this
        particular case, it should have done nothing. [...] Because of the
        logic, it lights up. But when you press it, other than putting the
        light out, it shouldn't have really done anything at all.
    
    Interpreting the Quotes: Reasoning to bug or software design error
    
    Quote A clearly says that the anomaly in software-implemented control logic
    in software caused rotor excursions. It also says that these excursions
    happened upon each reset. Quote D says that something happened because of
    the software-implemented control logic that was not supposed to happen. One
    of these has the form "A caused B", the other "something with property P
    happened because of A". We may presume that the rotor excursions were the
    sole relevant causal consequence of the anomaly; I conclude that the rotor
    excursions happened and were not supposed to. It does not yet tell us what
    requirement is referred to by "not supposed to". It gives us a choice
    between 1, 2, and 3(a), but does not distinguish between them.
    
    Quote B entails that one of 3(b) or 4(b) was the case.
    
    Quote C appears to be inconsistent with the other information.  (**) implies
    that recycling the button appears to be a causal factor in LOC. Now, either
    the pilots recycled the button because (i) it was NATOPS manual procedure to
    do so, or because (ii) it was their choice to do so. Either (i) or (ii),
    whichever is the case, is a causal factor in recycling the button, which
    itself is a causal ancestor of the LOC.  But both (i) and (ii) fall within
    the domain commonly termed "human factors". Hence it appears that human
    factors phenomena were causal ancestors of the LOC. That is in direct
    contradiction to Quote C.  The Marines' testimony appears to be
    inconsistent. (That may be because they are not speaking as precisely as I
    am trying to.)
    
    Quote G lets us distinguish somewhat between our choice of 1, 2, or 3(a). It
    says that pressing the button should have done "nothing", that is, it should
    not have caused a rotor excursion. That clearly suggests that the design was
    not compatible with control system requirements, and rules out 3(a). It was
    therefore a software bug or a software design error.
    
    This is the conclusion contained in my Risks-21.33 analysis.
    
    Quote F puts the anomaly "in the system". Design specification is not
    normally considered part of the system by most engineers (although I have
    argued elsewhere that this may be mistaken), so I take this quote to support
    (in the sense of giving extra credence to) the conclusion that there was a
    software bug or software design error.
    
    Software Reparatory Measures
    
    It should now be clear what reparatory measures would be recommended by a
    professional software engineer on the basis of this conclusion.  The control
    software can be regarded as providing a "service", a particular
    functionality, to the PFCS. In the case of a failure of type 1, the behavior
    did not provide the service specified in the software design. In the case of
    a failure of type 2, the software provided the function specified in the
    design, but this was not the service that the rest of the PFCS
    required. General prophylactic measure for these cases are:
    
    M.1) For software bugs. Inspect software against design specs; test
          software against design specs; remove bugs.
    
    M.2) For software design errors. Inspect software design against
          PFCS design; perform integrated PFCS bench tests; remove
          incompatibilities between software behavior and PFCS expected
          behavior
    
    It is significant, therefore, that neither of these two standard
    prophylactic measures was recommended by the BRPR.
    
    Quote from the BRPR
    
    The BRPR section on software is short and worth quoting in full.
    
    [begin quote]
    
    The fly-by-wire flight control system is highly dependent on high-quality
    computer hardware and software. The logic that is the basis for the many
    flight control laws and algorithms must be consistent with the overall
    requirement for FO/FS. This implies that if the aircraft suffers any single
    failure in the electrical, mechanical or hydraulic parts of the system,
    there cannot be any software logic characteristic or failure that would
    result in an unsafe condition. The integrated flight control system must be
    designed, analyzed, and tested with these facts in mind.
    
    Boeing has the lead role in development and testing of the integrated flight
    control system. Their Philadelphia facility has the capability to conduct
    integrated hydraulics, flight loads, and software testing using the Flight
    Control System Integration Rig. Before the mishap, the facility had limited
    pilot-in-the-loop capability. During the downtime, and in response to the
    preliminary mishap investigation results, Boeing has upgraded the
    capabilities of the integrated simulation facilities and is in the process
    of validating a set of off-nominal and failure scenarios that had been
    checked only by analysis during the 1996 validation and verification of the
    flight software.  Boeing also has begun validating all flight control system
    emergency procedures with pilot-in-the-loop simulation runs. In addition,
    the company is holding an integrated flight control system review, with
    participation from "graybeard" experts from within and outside the company
    to review the requirement and the implementation of the requirements in the
    design.
    
    Conclusion: The North Carolina mishap identified limitations in the V-22
    Program's software development and testing. The complexity of the V-22
    flight control system demands a thorough risk analysis capability, including
    a highly integrated software/hardware/pilot-in-the-loop test capability.
    
    Recommendation: Conduct an independent flight control software development
    audit of the V-22 program with an emphasis on integrated system safety.
    
    Recommendation: Conduct a comprehensive flight control software risk
    assessment prior to return to flight.
    
    Recommendation: The V-22 Program should not return to flight until the
    flight procedure and flight control software test cases have been reviewed
    for adequacy and have been evaluated in the integrated test facilities.
    
    [end quote]
    
    Analysis of the BRPR Section on Software Reliability
    
    There is nothing in the commentary or recommendations that implies M.1 or
    M.2. This is remarkable. Instead, the report emphasises integrated system
    safety, and integrated test facilities (in which they appear to emphasise
    pilot-in-the-loop testing).
    
    Standard system-safety and risk assessment involve identification and
    analysis of hazards, including assessing the likelihood of a hazard
    condition, and identifying the likelihood that an accident will result from
    a specific hazard. The hydraulic failure is a specific hazard; they say from
    this hazard "there cannot be any software logic characteristics or failure
    that would result in an unsafe condition".  They do not say "result in an
    accident", or "result in an unsafe condition or accident". This suggests
    that they believe that more factors were involved in the accident than the
    software logic alone.
    
     >From the JAGB, we concluded that there was a bug or a software design error
    that caused behavior that resulted, along with multiple resets and the
    asymmetric physical response of the rotors, in the LOC, which itself
    resulted in the accident. An informal WB-Graph of the accident according to
    the analysis of JAGB would contain the following chains of causal
    factors. (To obtain a partial graph from these chains, superimpose
    identically labelled features, e.g., "PFCS behavior" in the first three
    chains. I emphasise: the WB-Graph will be partial.)
    
    C1) HF -> multiple resets -> PFCS behavior -> dynamic behavior of AC ->
          -> LOC -> Accident
    
    C2) Physics of AC design and configuration -> dynamic behavior of AC
    
    C3) PFCS intentional design -> PFCS behavior
    
    C4) PFC anomalies -> PFCS behavior
    
    C5) Software subsystem design anomalies or bugs -> PFC anomalies
    
    Prophylactic measures are supposed to break the causal chains somewhere.
    Integrated testing including pilot-in-the-loop enables C1 to be broken by
    modifying the human behavior that led to the multiple resets, by changing or
    modifying procedures. C2 cannot be broken, because it is physically
    necessary, although the specific behavior can thereby be changed. No
    recommendation is made here to do this. Likewise C3 cannot be broken,
    because it represents physical necessity: PFCS design will causally result
    in behavior of the PFCS whenever the PFCS is activated (although of course
    it will not if the PFCS is never activated). PFCS design may be changed, to
    result in different behavior, of course, and this is what I take to be the
    purpose of risk assessment: how to mitigate the consequences of the hazard
    through PFCS change. C4 may be broken by removing anomalies; similarly C5
    may be broken by removing anomalies in the software subsystem.
    
    In a thorough safety audit, all chains that could be broken would be
    considered. But the BRPR speaks nowhere above of breaking C4 and C5, because
    nowhere is anything approaching M.1 or M.2 suggested. The BRPR concentrates
    instead on C1 (integrated testing with pilot-in-the-loop) and modifications
    in C3. (We may presume that modifications concerning chain C2 are all
    considered in another section of the report.)
    
    Comparing with the goals of the report and the qualifications of the panel
    members, this selection is comprehensible only on the hypothesis that these
    chains C4 and C5 aren't in fact there. But the JAG report implied that they
    were.
    
    Well, are they or aren't they there? JAGB says yes, BRPR implies no.
    
    Suppose they are not there, and that the PFCS functioned as designed and
    expected by its engineers. What would the accident scenario look like,
    consistent with the other information provided by JAGB?  Considering the
    taxonomy 1-4, I think only three possibilities present themselves.
    
    One possibility, suggested by Peter Neumann, is that the basic behavior with
    single reset was known, but the behavior with multiple resets was not
    considered either in the NATOPS flight manual procedure definition, or by
    the designers. The effects of multiple resets was not known. The resulting
    behavior turns out to interact badly with the asymmetric hardware response
    and resulted in this incident in the LOC.
    
    The second possibility is that the behavior even of a single reset was not
    considered in the integrated control systems. It was known by PFCS engineers
    that a rotor excursion would be commanded, but the physical characteristics
    of that rotor excursion, especially the asymmetrical rotor response, had not
    been determined.
    
    The third possibility, and I would imagine the least likely, is that the
    potential behavior in this situation was generally anticipated by engineers,
    but not known by or to the flight crew.
    
    I believe it is known whether one of these possibilities was the
    case. However, it is not inferable from the public information.  It is not
    my purpose to speculate. I shall stop here.
    
    Peter B. Ladkin <http://www.rvs.uni-bielefeld.de>  University of Bielefeld
    
    ------------------------------
    
    
    
    
    -------------------------------------------------------------------------
    POLITECH -- Declan McCullagh's politics and technology mailing list
    You may redistribute this message freely if it remains intact.
    To subscribe, visit http://www.politechbot.com/info/subscribe.html
    This message is archived at http://www.politechbot.com/
    -------------------------------------------------------------------------
    



    This archive was generated by hypermail 2b30 : Wed May 09 2001 - 16:31:51 PDT