System Resilience

From SEBoK
Jump to navigation Jump to search

According to the Oxford English Dictionary on Historical Principles (1973), resilience is “the act of rebounding or springing back.” This definition most directly fits the situation of materials which return to their original shape after deformation. For human-made systems this definition can be extended to say “the ability of a system to recover from a disruption .” The US government definition for infrastructure systems is the “ability of systems, infrastructures, government, business, communities, and individuals to resist, tolerate, absorb, recover from, prepare for, or adapt to an adverse occurrence that causes harm, destruction, or loss of national significance” (DHS 2010). The concept of creating a resilient human-made system or resilience engineering is discussed by Hollnagel, Woods, and Leveson (2006). The principles are elaborated by Jackson (2010).

Overview

The purpose of resilience engineering and architecting is to achieve full or partial recovery of a system following an encounter with a threat that disrupts the functionality of that system. Threats can be natural, such as earthquakes, hurricanes, tornadoes, or tsunamis. Threats can be internal and human-made such as reliability flaws and human error. Threats can be external and human-made, such as terrorist attacks. Often, a single incident is the result of multiple threats, such as a human error committed in the attempt to recover from another threat.

Figure 1 depicts the loss and recovery of the functionality of a system. System types include product systems of a technological nature and enterprise systems such as civil infrastructures. They can be either individual systems or systems of systems. A resilient system possesses four attributes — capacity , flexibility , tolerance , and cohesion — and thirteen top level design principles through which to achieve these attributes. The four attributes are adapted from Hollnagel, Woods, and Leveson (2006), and the design principles are extracted from Hollnagel et al. and are elaborated based on Jackson (2010).

Figure 1. Disruption Diagram. (SEBoK Original)

The Capacity Attribute

Capacity is the attribute of a system that allows it to withstand a threat. Resilience allows that the capacity of a system may be exceeded, forcing the system to rely on the remaining attributes to achieve recovery. The following design principles apply to the capacity attribute:

  • The absorption design principle calls for the system to be designed including adequate margin to withstand a design-level threat.
  • The physical redundancy design principle states that the resilience of a system is enhanced when critical components are physically redundant.
  • The functional redundancy design principle calls for critical functions to be duplicated using different means.
  • The layered defense design principle states that single points of failure should be avoided.

The absorption design principle requires the implementation of traditional specialties, such as Reliability and Safety.

The Flexibility Attribute

Flexibility is the attribute of a system that allows it to restructure itself in the face of a threat. The following design principles apply to the capacity attribute:

  • The reorganization design principle says that the system should be able to change its own architecture before, during, or after the encounter with a threat. This design principle is applicable particularly to human systems.
  • The human backup design principle requires that humans be involved to back up automated systems especially when unprecedented threats are involved.
  • The complexity avoidance design principle calls for the minimization of complex elements, such as software and humans, except where they are essential (see human backup design principle).
  • The drift correction design principle states that detected threats or conditions should be corrected before the encounter with the threat. The condition can either be immediate as for example the approach of a threat, or they can be latent within the design or the organization.

The Tolerance Attribute

Tolerance is the attribute of a system that allows it to degrade gracefully following an encounter with a threat. The following design principles apply to the tolerance attribute.

  • The localized capacity design principle states that, when possible, the functionality of a system should be concentrated in individual nodes of the system and stay independent of the other nodes.
  • The loose coupling design principle states that cascading failures in systems should be checked by inserting pauses between the nodes. According to Perrow (1999) humans at these nodes have been found to be the most effective.
  • The neutral state design principle states that systems should be brought into a neutral state before actions are taken.
  • The reparability design principle states that systems should be reparable to bring the system back to full or partial functionality.

Most resilience design principles affect system design processes such as architecting. The reparability design principle affects the design of the sustainment system.

The Cohesion Attribute

Cohesion is the attribute of a system that allows it to operate before, during, and after an encounter with a threat. According to (Hitchins 2009), cohesion is a basic characteristic of a system. The following global design principle applies to the cohesion attribute.

  • The inter-node interaction design principle requires that nodes (elements) of a system be capable of communicating, cooperating, and collaborating with each other. This design principle also calls for all nodes to understand the intent of all the other nodes as described by (Billings 1991).

The Resilience Process

Implementation of resilience in a system requires the execution of both analytic and holistic processes. In particular, the use of architecting with the associated heuristics is required. Inputs are the desired level of resilience and the characteristics of a threat or disruption. Outputs are the characteristics of the system, particularly the architectural characteristics and the nature of the elements (e.g., hardware, software, or humans).

Artifacts depend on the domain of the system. For technological systems, specification and architectural descriptions will result. For enterprise systems, enterprise plans will result.

Both analytic and holistic methods, including the principles of architecting, are required. Analytic methods determine required capacity. Holistic methods determine required flexibility, tolerance, and cohesion. The only aspect of resilience that is easily measurable is that of capacity. For the attributes of flexibility, tolerance, and cohesion, the measures are either Boolean (yes/no) or qualitative. Finally, as an overall measure of resilience, the four attributes (capacity, flexibility, tolerance, and cohesion) can be weighted to produce an overall resilience score.

The greatest pitfall is to ignore resilience and fall back on the assumption of protection. The Critical Thinking project (CIPP 2007) lays out the path from protection to resilience. Since resilience depends in large part on holistic analysis, it is a pitfall to resort to reductionist thinking and analysis. Another pitfall is failure to consider the systems of systems philosophy, especially in the analysis of infrastructure systems. Many examples show that systems are more resilient when they employ the cohesion attribute — the New York Power Restoration case study by Mendoca and Wallace (2006, 209-219) is one. The lesson is that every component system in a system of systems must recognize itself as such, and not as an independent system.

Practical Considerations

Resilience is difficult to achieve for infrastructure systems because the nodes (cities, counties, states, and private entities) are reluctant to cooperate with each other. Another barrier to resilience is cost. For example, achieving redundancy in dams and levees can be prohibitively expensive. Other aspects, such as communicating on common frequencies, can be low or moderate cost; even there, cultural barriers have to be overcome for implementation.


References

Works Cited

Billings, C. 1991. Aviation Automation: A Concept and Guidelines. Moffett Field, CA, USA: National Aeronautics and Space Administration (NASA).

CIPP. February 2007. Critical Thinking: Moving from Infrastructure Protection to Infrastructure Resilience, CIP Program Discussion Paper Series. Fairfax, VA, USA: Critical Infrastructure Protection (CIP) Program/School of Law/George Mason University (GMU).

DHS. 2010. Department of Homeland Security Risk Lexicon. Washington, DC, USA: US Department of Homeland Security, Risk Steering Committee. Available: http://www.dhs.gov/xlibrary/assets/dhs-risk-lexicon-2010.pdf.

Hitchins, D. 2009. "What Are The General Principles Applicable to Systems?" INCOSE Insight 12 (4): 59-63.

Hollnagel, E., D. Woods, and N. Leveson (eds). 2006. Resilience Engineering: Concepts and Precepts. Aldershot, UK: Ashgate Publishing Limited.

Jackson, S. 2010. Architecting Resilient Systems: Accident Avoidance and Survival and Recovery from Disruptions. Hoboken, NJ, USA: John Wiley & Sons.

Mendoca, D., and W. Wallace. 2006. "Adaptive Capacity: Electric Power Restoration in New York City Following the 11 September 2001 Attacks." Presented at 2nd Resilience Engineering Symposium, November 8-10, 2006, Juan-les-Pins, France.

C. T. Onions (ed.). 1973. Oxford English Dictionary on Historical Principles, 3rd ed., s.v. "Resilience". Oxford, UK: Oxford Univeristy Press.

Perrow, C. 1999. Normal Accidents. Princeton, NJ, USA: Princeton University Press.

Primary References

DHS. 2010. Department of Homeland Security Risk Lexicon. Washington, DC, USA: US Department of Homeland Security, Risk Steering Committee. Available: http://www.dhs.gov/xlibrary/assets/dhs-risk-lexicon-2010.pdf.

Jackson, S. 2010. Architecting Resilient Systems: Accident Avoidance and Survival and Recovery from Disruptions. Hoboken, NJ, USA: John Wiley & Sons.

Additional References

Jackson, S. 2007. "A Multidisciplinary Framework for Resilience to Disasters and Disruptions." Journal of Design and Process Science. 11: 91-108.

Madni, A., and S. Jackson. 2009. "Towards A Conceptual Framework for Resilience Engineering." IEEE Systems Journal. 3 (2): 181-191.

MITRE. 2011. "Systems Engineering for Mission Assurance." System Engineering Guide. Accessed March 7, 2012. Available: http://www.mitre.org/work/systems_engineering/guide/enterprise_engineering/se_for_mission_assurance/.



< Previous Article | Parent Article | Next Article >
SEBoK v. 1.9.1 released 30 September 2018

SEBoK Discussion

Please provide your comments and feedback on the SEBoK below. You will need to log in to DISQUS using an existing account (e.g. Yahoo, Google, Facebook, Twitter, etc.) or create a DISQUS account. Simply type your comment in the text field below and DISQUS will guide you through the login or registration steps. Feedback will be archived and used for future updates to the SEBoK. If you provided a comment that is no longer listed, that comment has been adjudicated. You can view adjudication for comments submitted prior to SEBoK v. 1.0 at SEBoK Review and Adjudication. Later comments are addressed and changes are summarized in the Letter from the Editor and Acknowledgements and Release History.

If you would like to provide edits on this article, recommend new content, or make comments on the SEBoK as a whole, please see the SEBoK Sandbox.

blog comments powered by Disqus