Complex Systems and how can they fail !!

by in4maniac

Introduction

This blog post is inspired by a meetup that I attended few weeks ago where they discussed how complex systems behave and how they can fail (papers we love). While listening to this presentation, I couldn’t help but realise how similar our experiences at work are to the facts that were revealed in this meetup. Therefore, I decided that I should write a post about our experiences on how we can deal with complex systems of this nature.

Complex Systems

Complex systems can be defined as a collection of components/parts, both physical and logical, working together to form a unified system. The individual components of this system are tied together to deliver very powerful solutions. But this system is very delicate and small changes can lead to catastrophic consequences. The term Complex system is coined to systems that are difficult to fit into conventional mechanistic concept provided by science (Complex Systems). When we think about computer science, present systems have swayed from being conventional waterfall modelled software projects to more resilient, reactive methodologies that enable developing solutions that adapt to rapidly changing environments. With the recent trending of new paradigms such as clouds, distributed computing, big data and various new technological vectors, the systems are required to adapt and scale rapidly. These trends have introduced a whole new domain of failure threats that have to be addressed when building systems. The introduction of micro-services, changing software and elastic hardware pieces to a system have led to having complex systems in Datacentres.

Classic Examples

A classic example of a complex (mission critical) system is a nuclear launching bunker. This system involves a series of software, hardware, physical and human components that have to work together smoothly to maintain the system. Although most of it is classified, this post gives a gist of the processes involving handling a nuclear silo. There are strict protocols and multiple measures implemented to fail proof the system as much as possible. Another great (maybe, not so great after what we’ve been hearing lately…) example is the airline protocols that are empowered to ensure technical and functional safety of a journey. The flight has to be checked several times and verified by multiple engineers before the flight is ready to take off. The communication and navigation channels are rigorously checked. On top of all this, there are strict protocols of conduct for the airline staff and passengers to ensure inflight safety. Having to turn off all electronic and communication devices during takeoff and landing is one of them. After the 911 attacks, it was essential for cockpit doors to have locks to prevent hijack. There are hardware such as the black box to record numerous sensor readings and sound recording to rigorously investigate any accident. You can refer to IATA Operations Safety Audit (IOSA) certification for full details.

At work…

Recently, I’ve been involved in some exciting projects that were critical to our company. One of the systems projects was a piece of software that produced customer facing data. The success of the system decided if the prototyped service will continue or not. Although this service cannot be considered as a mission critical system in comparison to transportation, healthcare and defense systems around us, it is fair to treat a system of this nature as a mission critical system within the context of a business organization. The performance and delivery of this product has dire consequences on how the business strategy would change in the days to come. Working in a Research and Development team, we thrive hard to build a Minimum Viable Product and start customer trials as soon as possible. At this stage, the system will:

  1. be “early adopter ready”
  2. give a sufficient representation of the end product

As we were dealing with TBs of data, this system was built around Apache Spark. A pre configured Linux Cron Job lets us trigger the daily computation process at 2am everyday. The system uses a series of configuration files and python scripts to

  1. Setup the Spark Cluster in Amazon
  2. Setup the environment
  3. Clone and configure the relevant git repositories
  4. Run a series of spark jobs (a data pipeline)
  5. Do monitoring and reporting

This system interacts with different subsystems such as Relational Database Management Systems, git, Amazon Web Services such as Elastic Computing (EC2), Elastic MapReduce (EMR), Simple Storage Service (S3) and so forth. There were two main non-functional requirements of this system.

  1. We should be able to incrementally improve the algorithms we were using to generate the information
  2. The system should deliver results on daily basis

As we are changing the algorithms (code, obviously) systematically, we have to make sure the system will not fail due to this. Therefore, changing anything in the system should be done with extreme caution to avoid any catastrophic failures.

Avoiding Failures

There are numerous ways that complex systems can fail. A good list of reasons can be found in a paper by Dr. Richard I Cook. In this section, I will outline some aspects of failures from this paper that we had to anticipate and how we addressed them.

Complex systems contain changing mixtures of failures as they are not perfect

Complex systems have multiple moving pieces. Due to the change in technology and processes, there is a possibility for latent failures to occur. Sometimes, analysing the error is challenging solely due to the fact that there are so many moving dependencies that keep the system together.

Change introduces new forms of failure

As I mentioned earlier, our system shouldn’t fail although we change our algorithms from time to time. The problem in implementing a change is that there are so many pieces that needs different configs when testing and deploying the code. We have different cron jobs, clusters configurations, data source configs that are different in testing and production.This demands a strict protocol for change management. One approach that we can use is to have standard checklists among the team to ensure that a strict protocol is followed when deploying a new change. Some example items in these checklists can be :

  • Record the git hash of the last stable version of the code
  • Check the cluster configuration files (more priority to specific parameters)
  • Check if the repository is in the correct branch
  • Verify the time and command parameters in crontab                                            & so on…

These checks allow us to make sure that vital areas of the unified system are validated before deploying the changes to production.

Complex Systems are heavily defended against failures

Complex systems are inherently and unavoidably hazardous. Therefore, they are built to be as fail proof as possible. The high impact consequences also demand these systems to be fail proof. Due to this, a lot of exception handling and good coding practice is used in building such systems. In terms of the system in context, the project owner defines a common standard on how functions and data items should be introduced to the main system. There is a defined structure that should be used when introducing

  • new data sources
  • new lookup tables
  • new machine learning classifiers
  • new data fields and etc…

This allows different people in the team to understand and review code from other members of the team. This is very important when different people are pushing different features that are merged to the same code base. It also helps immensely when we have to deal with someone else’s code fragments when mitigating a failure quickly.

Catastrophe is always round the corner

In addition to coding practices, we have to keep backup scripts that run stable code and mitigate failures when they occur. For instance, we should keep backup scripts that enable us to resume the system with previous versions of stable code in case the newer version fails unexpectedly. We should keep them prepared (URGENTsystemX.sh) so that we could trigger them immediately without having to spend time troubleshooting the existing system before reacting. This is important in taking out the pressure when trying to debug the failure and lets us do a better job at fixing the current problem. This approach may not work for every system. But in some cases where multiple verstions of your system can produce the desired output, this is is a great way to mitigate failures without causing a chain reaction.

Multiple small failures lead to a catastrophic failure collectively

In the domain of complex systems, some failures by themselves are not large enough to be noticeable. But when these small errors add up together, it collectively triggers a catastrophic outcome. Our systems compute daily statistics that we use to compute weekly and monthly statistics for our data services. A minor technical error leading to saving incomplete daily statistical records ultimately lead to generating misleading weekly figures. These technical difficulties do not affect the results on daily level. But they lead to unacceptable errors when they are aggregated together. Something we can do to mitigate this type of failures is to validate some vital statistics that represent this type of errors. For example, number of records per day, the size of the output file and etc… can be validated for anomalies. This allows us to investigate more and deep dive into the problem if we observe anomalous values that are unlikely to occur (very little number of records, less number of partitions in a file and etc…).

Post-accident attribution to a root-cause is fundamentally wrong

A common mistake most of us do after an accident is trying to isolate a single root-cause. Although this works for simple systems that has a fairly straightforward structure, this approach seldom works with complex systems. Most of the time, the outcome is a chain of failures that occur in different parts of the system. They can be either dependent or independent from each other. No individual incident is sufficient to break the system. Lets look at an example. Imagine a system failing because it couldn’t reach one or several components of a geographically distributed DBMS. The natural instinct of the engineers would be to find the persons running the DBMS and encourage them to improve availability. By taking this action, we are ignorantly overlooking the fact that our system is not defended against DBMS service outages. In this case, there are multiple contributors to the failure although there is a starting point. Trying to isolate a single part of the system (and the person who built it) to be a root-cause only shows lack of understanding of the system. It is mainly the human and cultural urge to dump blame on an individual entity. A better approach would be to also assess the in-house system and implement some approximate querying mechanism to use the available sites. Which brings the next point.

Views of ’cause’ limit effectiveness of defenses against future events

Following from the earlier point, I cannot stop emphasizing the importance of looking at the big picture when mitigating accidents in complex systems. I came across a perfect example couple of days ago when our Spark clusters started throwing a Java Exception restricting SSL handshake renegotiation. This error was never thrown before, and only started when we upgraded the Java version we were using with Apache Spark. When we figured that this exception only gets thrown on long running spark jobs, the immediate remedy was to keep our data processes short. Alternatively, we could downgrade our systems. But once we digged into the problem, we realised that this error is a result of proofing TLS protocol stack from Poodle vulnerability. And we can use our system without this error by only downgrading TLS version. In the context of this example, even downgrading the TLS version is not the wiser solution because it makes clusters vulnerable to cyber attacks.

Safety is a characteristic of the system; not of its components

The safety of a complex system is not represented individually by

  • quality of defense in individual software components
  • defense mechanisms used in integrating the components together
  • level of expertise people bring in to the system
  • process integrity

but is a mix of all these parts. The best approach to defend a stable complex system from accidents is to use structured change to introduce changes to the system and stabilize the system as a whole.

Improve system components

  1. Having code reviews before pushing code to the repository
    1. Code review for logical correctness
    2. Peer review increases the possibility of seeing the bigger picture better
    3. Simple features should be written simple, complex features should be possible.
    4. Review for maintainability of source code is a must. If no-one wants to review your code, no-one would want to deal with it when it fails. Messy, unreadable code is unacceptable !!

Improve processes

  1. Having a source control strategy for R&D and production
    1. Put new features in new branches
    2. Review and merge them as soon as they are finished
    3. Having a separate Production branch avoids a lot of merging accidents

Improve people

  1. Train for hazardous situations
    1. Having exposure to hazardous experiences helps a lot when a real one comes on its way
    2. Netflix Chaos Monkey is a great example
    3. Familiarise the team with different parts of the system by getting individuals to work in multiple components. This also enables having multiple domain experts per component in a team

Conclusions

Complex systems may not be easy. But they are possible. By carefully investigating and analysing the complex system as a unified entity, a lot of complex problems that arise from complex systems can be managed. One of the primary lessons I learned with my experience is that complex systems can be done right. The key to success is to keep focus on the long-term solution rather than hiding the issue with a temporary fix. A temporary fix that can bring the system back up is very important. But this shouldn’t be your stopping point. The key is holding on to the issue until you have systematically mitigated the latent issue.

*** This is my first blog post after several years. So I am sure that this one is nowhere near perfect. I would really welcome your feedback or comments on this post so that I can shape it well and do a better job next time. Thanks alot for taking the time to read it. I hope it was helpful. 🙂

Advertisements