Systems Thinking, Safety and Risk Management

A few weeks ago I participated in a roundtable breakfast on Protecting Financial Markets in the Age of the Cloud. The roundtable gathered together a group, – consisting primarily of financial experts, with a sprinkle of technologists like me, – to explore how to best regulate financial markets in our fast changing, highly interconnected, complex, digital world.

Several participants cited the sharp rise in volume and speed of high-frequency trading over the past decade as an example of a current practice that is potentially increasing volatility and systemic risk in financial markets. They mentioned the flash crash of May 6, 2010 as the kind of incident that is not well understood, and whose recurrence could compromise the stability and general trust needed for financial markets to function smoothly. At least one of the roundtable participants was convinced that another major event, similar to the global financial crisis of 2008 would inevitably take place in the next few years.

Toward the end of the roundtable, the moderator asked us to reflect on the kinds of controls needed to make financial systems more stable and less prone to another major crisis. Can you do so with incremental, evolutionary safeguards, or do you need major new regulations? Can the industry institute the needed safeguards on its own or do governments have to play a bigger role?

These are very complex and important questions for which there are no simple answers. Our answers often reflect where we stand in the conservative-liberal spectrum. My own point of view is shaped by my interests in complex engineering systems, in particular, in systems whose components are interconnected via digital networks; where software plays a major role in their overall design and operations; and, which involve people and information as well as technology.

Financial systems are an example of such complex engineering systems. Thus, I am hopeful that much of the research on the design and operation of such systems will shed light on how to better manage financial systems as well.

In the industrial economy of the past century, we learned how to build highly sophisticated physical objects like airplanes, cars, bridges, skyscrapers and microprocessors. Over time we have improved the overall quality and safety of these physical systems through a variety of engineering principles. These include the hierarchical decomposition of the system into relatively independent, functional modules and components; making the individual components and functional modules as reliable as possible; having well defined processes for the assembly and operation of the systems; extensive simulation of the behavior of the components, modules and overall system; and continuous quality measurements and improvements.

This approach to quality and safety works well for systems that are relatively static and deterministic, that is, whose future behavior can be generally calculated because the system will produce the same outputs from a given set of inputs. However, such an approach breaks down for complex systems composed of many different kinds of components, intricate organizations and highly different structures, all highly interconnected and interacting with each other. Such systems exhibit dynamic, unpredictable behaviors as a result of the interactions of their various components.

Most complex, software-intensive systems fall under this category. For the last few decades we have been using software-based digital systems in the design of complex machines, including airplanes and cars. More recently increasingly sophisticated smart systems are being used in industry after industry, including energy management, transportation and urban planning.

Software has enabled us to design systems with seemingly unlimited capabilities. But, as a result, these systems exhibit a level of complexity that is often beyond our ability to understand and control. The very flexibility of software means that all the interactions between the various components of the system cannot be planned, anticipated or tested. That means that even if all the components are highly reliable, problems can still occur if a rare set of interactions arise that compromise the overall behavior and safety of the system.

Socio-technical systems are not only software intensive, but they involve people as well as technology. Such systems have to deal not only with hardware and software technical issues, but with the even more complex issues involved in human behaviors, business organizations and economies. We are increasingly developing such socio-technical systems in areas like health care, education, government and finance.

How can we best manage the risks involved in the design and operation of such complex, software intensive, socio-technical systems? How do we deal with a system that is working as designed but whose unintended consequences we do not like? How can we make these systems as safe as possible?

The best work I have seen on the subject is by Nancy Leveson, professor of aeronautics and astronautics and engineering systems at MIT. Professor Leveson is a leader in the field of safety engineering in highly complex systems. Earlier this year she published her new book – Engineering a Safer World: Systems Thinking Applied to Safety, which can be downloaded here. In the Preface she writes:

“The world of engineering has experienced a technological revolution, while the basic engineering techniques applied in safety and reliability engineering, such as fault tree analysis (FTA) and failure modes and effects analysis (FMEA), have changed very little. Few systems are built without digital components, which operate very differently than the purely analog systems they replace. At the same time, the complexity of our systems and the world in which they operate has also increased enormously. The old safety engineering techniques, which were based on a much simpler, analog world, are diminishing in their effectiveness as the cause of accidents changes.”

“For twenty years I watched engineers in industry struggling to apply the old techniques to new software-intensive systems – expending much energy and having little success. At the same time, engineers can no longer focus only on technical issues and ignore the social, managerial, and even political factors that impact safety if we are to significantly reduce losses.”

The classic approaches to safety assumed that accidents are caused by component failures or by human error. Thus making components very reliable, introducing fault tolerance techniques and planning for their failure will help prevent accidents. Similarly rewarding safe human behavior and punishing unsafe behavior will eliminate or significantly reduce accidents. These assumptions no longer apply for complex, socio-technical systems.

For example, an underlying assumption in engineering has been that: “Safety is increased by increasing system or component reliability. If components or systems do not fail, then accidents will not occur.” But, writes Leveson, “. . . it's not true. Safety and reliability are different properties. One does not imply the other.”

A physical or organizational complex system can be reliable but unsafe. As systems become increasingly complex, the interactions between their components dominate the overall design. Accidents can occur from the unanticipated interactions among components which are all working fine according to their individual specifications.

Alternatively it is also possible to have a safe system despite unreliable components if the system is properly designed and operated. This requires taking safety into account at the design stage of the system, rather than expecting the system to be safe because all the individual components are reliable. It requires modeling the overall behavior of the system, so one can identify ways to eliminate or reduce unsafe conditions.

Socio-technical systems generally involve a complex relationship between humans and technology. Increasingly, we can automate the simpler, more repetitive tasks in a system, thus freeing the humans to focus on higher-level decisions, including handling exceptional events. Humans are thus sharing the overall control of the system with the automation technologies. Such an approach will generally lead to significant improvements in productivity and quality. But, it can also lead to new kinds of unanticipated accidents that we often blame on operator error.

Whereas a well functioning, reliable machine is supposed to always perform the same, specified action, humans are different. In fact, decisions that can be precisely specified are generally automated, leaving to the humans those decisions that require judgement. People are supposed to gather all the available information, evaluate the overall physical and social environment, and then make the best possible decision. They will generally follow what they view as an effective procedure, rather than just follow a written, specified practice. But, if that decision results in an accident, it will be classified as operator error because the specified practice was not followed.

Leveson argues that upon further investigation, many of the accidents originally blamed on operator error invariably uncover other factors. It is often very difficult to separate system design flaws from operator error, especially in highly automated systems where the operator is at the mercy of the system design and operational procedures. Moreover, the instrumentation of the system often does not provide the information required by an operator to effectively recover from a hazardous condition during a real-time crisis.

Research into complex engineering systems, a relatively young discipline, will help us develop increasingly capable and safer systems. I believe that this is the kind of research we need to help us better manage the risks associated with our increasingly complex financial systems. The more we understand the causes of systemic financial instabilities, the better we can put in place the proper safeguards to help reduce the likelihood of another major crisis.

One response to “Systems Thinking, Safety and Risk Management”

Hank Bennett

May 3, 2012

Irving – I tend to agree with the roundtable participant who predicted another crash in the next few years. History has shown that periods with very high discrepancy between the salaries of the highest executives and the “grunts” of the corporations, coupled with lack of or poor regulation of financial markets, are periods of enormous market volatility. Such was the case just before the Crash of 1929 which set off the Great Depression, and such was also the case just before the onset of the so-called “Great Recession. it is STILL the case, and this alone is enough to make a fairly firm prediction that we will have another crash, perhaps much more severe than the recent one!

Loading…

Irving Wladawsky-Berger

RECENT POSTS

CATEGORIES

Subscribe to this blog via email