There is a saying in Latin that I use for my Cloud Migration clients:
Si vis Victoria, embrace Defectum
This translates into:
If you want success, embrace failure
This means, to establish success in any kind of endeavor, you should accept the fact that failure is part of the nature of the work that you want to do, and you need to be ready to respond to failures and design and re-design your approach to incrementally arrive at the stable and robust stance you want to achieve.
The difference between on-premises deployment and Cloud
When you want to create a reliable application in the cloud, you need to take a different path than what you would do if it was going to be on the premises of the organization you work for.
When you are coding on the solid grounds of your organization, you build reliability and robustness by using more expensive (higher-end) infrastructure (mostly hardware) to respond to the expected rising demand from your applications. When doing the same in the Cloud, you must design it to provide a certain level of lower capacity to serve the normal demand levels and then be able to “Scale-Out” when demands arise (due to many potential reasons such as users’ preferred time-of-day or day-of-week or seasonal or occasional factors).
The magic behind the low cost of Cloud environments is cheap hardware (aka “Commodity” hardware), which makes them inexpensive enough to simply join them up when extra capacity is needed and plug them off when demand drops and more importantly being able to toss them into the trash when they fail.
When you have access to such a level of hardware abundance – of course for a fee – you should shift your architecture’s focus from avoiding failures and trying to improve the mean-time between failures, towards designing it for improving the mean-time to restore.
You need to be ready to embrace the failures and orchestrate our resources for the fastest possible recovery from it.
Commodity hardware is one of many reasons why failure is always inevitable, but it also means you can affordably get access to large amounts of it, to the satisfying levels of putting together a fast recovery model and minimizing the impact of the failures.
Interesting enough, the reasons for (or sources of) failure have a very wide range (from Nature to Humans):
- Natural Disasters: Your data-center can have outages due to natural disasters. It can even get as bad as losing it to floods to wildfires.
- Sabotage or Terrorism: Your data-center can have outages due to intentional acts of someone (or a terrorist attack).
- Failure of Commodity Hardware: For 100s of reasons, such as losing any key components (from a failing cooling fan leading to overheating components, to burning out of CPUs, hard disks, networking equipment …)
- Bad Coding: A Severity 1 error that has managed to make its way to our deployed code in Production, and is now behaving in an unexpected manner that has not proper handling measures in the application and is causing a cascading negative effect throughout the service up to the level of bringing it to a halt.
- Untimely Shutdown of Nodes: The Orchestrator that you are using (i.e. the IaaS or PaaS) decides it is time to reduce the capacity that your application is using (i.e. scale down) and drop the number of instances, only the timing is unfortunate and it happens that the instance is not able to shut down cleanly and ends ups crashing the node and failing the service that was engaged by a customer.
- Untimely Deployment of Updates: Sometimes, when you are deploying an update you may cause a crash in the receiving node and if at that time the service was busy responding to a customer’s request it would cause service disruption (and if it is a complex process, like a financial transaction, you would need to do a cascading roll-back).
- Large Scale Failures (Black Outs): In higher scale (and of course less frequently) scenarios, an entire AZ (availability zone) or Region may go dark due to a large power failure in the area.
Now that you have established a number of ways that failures can — and inevitably some of them will — happen, you need to make sure our architecture is based on considering the fact that failures can happen and you need to enhance our design for resiliency, while you will use redundancy to avoid creating a single point of failure for our services.
Avoiding single points of failure also means not relying on a single machine for services (as eventually, that machine would fail in the future).
Cat or Cattle?
One recommendation for operational robustness is enabling yourself to treat your servers like cattle, and not your cat (or pet in general).
That means, your applications and service should not become too attached to a certain server. That level of attachment puts us at a high risk of losing our capacity to serve our customers if something happens to that server.
If your servers are considered and treated as part of a cattle, with almost no preference of discrimination among them, then if something happens to one of them, you can switch to another one, or retire some and bring up others easily and momentarily, instead of trying to fix them and bring them back online (same as you would do if our cat was sick and you needed to care for it until it would come back to full health).
Design to Recover
When designing the applications (and services), you can benefit a lot from running a Failure Mode and Effects Analysis (FMEA).
Failure mode and effects analysis (FMEA) — also known as “failure modes” was one of the early structured approaches using “systematic techniques for failure analysis.”
FMEA is often used as the first stage of a system reliability study. It will engage as many components, and systems as possible to provide a larger probability of identifying failure modes and rooting out their causes and effects. They can analyze a mix of Functional, Design, or Process aspects of the failures. Numerous variations of FMEA worksheets are used to capture information on each component, the failure modes, and their resulting effects on the rest of the system. FMEA can be a qualitative analysis but maybe put on a quantitative basis when mathematical failure rate models are combined with a statistical failure mode ratio database.
Sometimes FMEA is extended to FMECA (failure mode, effects, and criticality analysis) to indicate that criticality analysis is performed too.
After you have identified the points of failure, you need to brainstorm and decide (and design) on how you should react when those failures happen:
- How do you get notified when each type of failure has happened (or is in progress)?
- How do you react and what kind of mitigation or response will be initiate for each type of failure?
- How do you keep track of failures and monitor their life cycle?
Design for Auto-Recovery
When designing our applications, you should put considerations for their auto-recovery in case of failures. Auto-Recovery design should be able to survive the failure through detection of it, reacting to it, and keeping a record of the entire interaction to help us assess the efficiency of the design and mitigation measures and help us incrementally improve it over time.
Here are some of the time-tested best practices:
Be Realistic in your Designs
Everyone has a blind spot toward their errors! In many cases, you test for what you want to see the system do (aka Happy Paths), and do not look for a failure path.
Your application may be deployed and running in your cloud setting for quite a while before a failure path happens. Use techniques like Fault Injection to measure your application’s robustness. You can create intentional failures to see how your design will handle itself.
The Monkey behind the wheel
Chaos engineering is experimenting on a software system’s ability to withstand failure through putting in through turbulent and unexpected conditions.
In software development, a given software system’s ability to tolerate failures while still ensuring adequate quality of service — often generalized as resiliency — is typically specified as a requirement. However, development teams often fail to meet this requirement due to factors such as short deadlines or lack of knowledge in the field. Chaos engineering is a technique to meet the resilience requirement.
Chaos engineering can be used to achieve resilience against:
- Infrastructure failures
- Network failures
- Application failures
The Simian Army is a suite of tools developed by Netflix to test the reliability, security, or resiliency of its Amazon Web Services infrastructure and includes the following tools:
Netflix invented Chaos Monkey in 2011 to challenge the resilience of its IT infrastructure. It intentionally disables computing units in Netflix’s production network to see how the remaining computing units withstand and absorb the impact of the outage. Chaos Monkey later expanded into a larger suite of tools called the Simian Army which is a much more sophisticated toolset, designed to simulate and assess the responses to various system failures. Netflix released the code behind Chaos Monkey in 2012 and under an Apache 2.0 license.
The name “Chaos Monkey” is explained in the book Chaos Monkeys by Antonio Garcia Martinez:
“Imagine a monkey entering a “data-center”, these “farms” of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices, and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.”
Sitting at the top of the Simian Army hierarchy, Chaos Gorilla simulates a blackout at a full Amazon “Availability Zone” (which means one or more data centers serving a sub-area of a geographical region would seem to have lost connection to our network).
As the name implies, it introduces communication delays to simulate bandwidth degradation as a result of a major outage in a network.
This tool is responsible for “Pulse Checking” of services, by monitoring performance metrics such as CPU load to detect unhealthy computing instances, and for root-cause analysis and eventual fixing or retirement of those instances.
Finds and disposes of unused resources to cut costs and to avoid creating paths of failure on streams that are not even needed.
A compliance monitoring tool to assess and determine whether a computing instance is nonconforming by testing it against a set of rules.
This is a tool that is derived from the Conformity Monkey and searches for vulnerable computing instances (due to bad setup or service issues) and disables them.
This tool is used in identifying localization and internationalization issues (known by the abbreviations “l10n” and “i18n”) in applications serving customers across diverse geographic regions.
Specific to running and testing failure scenarios in JVM applications. It works by instrumenting application code on the fly and creating faults such as exceptions and latency issues.
This tool uses bytecode at the JVM level and assesses and analyzes the error-handling ability of each exception handling block involved in the applications.
A quite sophisticated “Failure-As-A-Service (FAAS)” platform specialized in offering companies a fully hosted solution to experiment on their complex systems and fish out point-of-failure and design weaknesses before they become production environment issues and affect the business.
Also known as the “Storm Project”, used by Facebook to test its capacity and readiness against the loss of a datacenter. Facebook uses this tool regularly to check the fault-tolerance of the serving infrastructures in response to major events.
As the first Open Source tool of Chaos Engineering, ChaoSlingr performs security experimentation on AWS Infrastructure to proactively discover system security weaknesses in complex distributed system environments. It was first published on Github in September 2017.
Another Open Source tool, Chaos Toolkit, is designed to simplify using Chaos Engineering concepts. It specializes in demonstrating the experimentation approach can be done at different levels: infrastructure, platform but also application. The Chaos Toolkit is licensed under Apache 2 and was first published in October 2017.
A good tool to help you to run chaos engineering experiments against your applications you’re your infrastructure components to evaluate their resiliency and fault tolerance. It allows you to simulate faults with very little pre-configuration. It can work with a wide variety of infrastructure components including K8S, Docker, Center, or any Remote Machine with SSH connection enabled. It also provides a plugin that allows for defining custom faults using templates and running them (with very little coding).
Set a Leader for Coordination
In the fault-tolerant design, use a practice called “Leader Election” to designate a “Coordinator” to avoid creating a single point of failure. This way, should the “Coordinator” instance fails, a subsequent one can be selected in its place without any disruption to the workflow.
You can either implement one yourselves (not really recommended!) or use a market-tested solution such as Apache ZooKeeper for this purpose.
Fail a few times before raising a flag
The idea is that, since there are 100s of different ways for failure, if you run into one, try a few more times to access the service again.
Most of the time the failure is just the result of a momentary loss of access to the network or a resource (like database) and may be easily restored by the next time you try. Design your applications’ front-end in a way that they would retry connecting to the services in the back a few times before throwing any error messages. There is of course a limit to the number of trials.
Do not fail too many times! (Fail-Fast-Enough to avoid Backing-up)
To avoid creating a backed-up queue of failed requests, decide on how many retrials makes sense for your applications to try getting connected with the backend before it raises a flag. You need to maintain a working balance between the two by designing the applications to fail-fast-enough so you can detect the problem and avoid retrying the failed service too many times.
It is recommended to use the “Circuit Breaker” Cloud solutions design pattern to handle failure. The Circuit Breaker pattern, is introduced by Michael Nygard in his book, Release It! and is designed to prevent an application from repeatedly trying to execute a failing activity. You can also let it to continue without waiting for the fault to be fixed while it decides that the fault is not going to get fixed immediately. Circuit Breaker can enable an application to detect if the fault has been resolved and if this proves to be true, it will again try to invoke the connection.
Break the Domino Effect
Sometimes failures in one area would put in motion the domino effect of failures across your entire structure, especially when their failure would directly impact the timing or capacity of other resources that are needed by the rest of the application, causing overflows and blockades.
To avoid the Domino falls, you need to partition a system into isolated groups, so failures would stay within your partition and would not overflow into other ones!
Balance the Load
Your applications would go through spikes (sudden demand for more resources) that can overwhelm services on the back-end. One good practice would be to use the Queue-Based Load Leveling pattern to design the work items to run asynchronously. The queue will serve as a buffer to absorb the peaks in the load without clogging the traffic pattern.
Do not Get Glued to an Instance!
Regardless of whether you are running a stateless or stateful service, do not get yourself glued to a failing instance. If you cannot reach it, your design should allow you to fail-over to the next available instance, or immediately initiate a new one. If you can design it to be Stateless, like a web server, then have a good amount of instances behind a load balancer or traffic manager to give your application a very soft fail-over.
Do not Get Over-Distributed!
If it is not absolutely necessary, try to avoid designing your workflows to function in a distributed transaction model. They have the disadvantage of imposing a considerable process overhead, due to their need for real-time, constant coordination across services and resources which allows for the creation of too many failure points. Rolling back on failed distributed transactions are another big design headache as too many failure scenarios would need to be considered in your design to properly process them.
Keep Resource Vampires in Check!
You might run into scenarios where a few customers are creating heavy demand on resources and lower your availability to other customers. While you may have service availability tranches, you might need to make sure they are served as per the agreement and their overages are not going to suck your services’ blood dry! The best practice would be using Throttle patterns to control the flow within the set bandwidths.
Drop the Frills to keep the Core running
If a troubled software component or piece of the workflow, is not considered vital to the main service that is being rendered, then you should be able to drop that service to provide the main part of the service. For example, if your customer is leading their clients through a shopping cart to make a sale, then the core function of the cart and financial transaction is the core functions you need to keep working while our broken recommender system or up-sell features are under repair.
Reduced functionality is better than no functionality while you are trying to fix the issue.
Have Multiple Checkpoints for Long Processes
In cases that you cannot use short processes or refrain from distributed ones, it is both resource-intensive and complex to roll back a very long-running process and ramping up again to continue the work.
To better manage the long or widely distributed processes, you should establish checkpoints to make it less costly, and much faster to get the next computing instance (aka virtual machine) to ramp up and take over the failed work and be able to find where the process stopped and take it from that point.
In general, to achieve success in Cloud solution delivery, you need to be ready to embrace Failures and have a plan to address them based on their complexity model and impacting the area.
This sits at the core of the DevOps team’s Agility as you should have our incremental plan for improving the robustness and recovery of your solutions through an ongoing improvement of your designs and considerations for avoidance and recovery from Failures.
Failing Fast, as a key driver of DevOps teams’ success, provides the Quick Feedback Loop that DevOps teams thrive on and incorporate in their ever-improving delivery pipeline.
Written by Arman Kamran, CTO of Prima Recon and enterprise scaled Agile Transition coach.