How to master disaster recovery in a DevOps world

Organizations that are adopting DevOps methodologies are realizing actual benefits to disaster recovering planning from taking that approach.

Comments

According to a 2015 survey by IT Revolution Press in conjunction with Puppet Labs, organizations using DevOps deploy code 30 times faster than others, doing deployments multiple times per day. Moreover, change failure gets cut in half with DevOps and services are restored up to 168 times faster than they are at non-DevOps organizations.

DevOps: Failing more quickly, and recovering faster

Let’s focus on those last two points for a moment. One thing is for certain: Embracing DevOps also pays off from a disaster recovery standpoint, because the tools and procedures that you use to move applications from development to testing to production and back to development again can also be applied to failing over and recovering from disasters and service interruptions. The same tools that automate the entire DevOps life cycle can also help you make the most use of the resources you already have for recovery purposes.

There are indeed plenty of open-source tools to help with this automation, like Chef and Puppet, which create, launch and deploy new virtual machine instances in an automated way and configure them appropriately. They even work across security boundaries, deploying on your private laptop, in your own data center, or even up in the public cloud — Amazon Web Services and Microsoft Azure are two major public cloud providers that support Chef and Puppet. By using these tools, you can not only automate the deployment of new code your developers write using carbon copies of their development machine environment and configuration; you can also orchestrate and launch a backup environment in the cloud in a matter of minutes.

If you combine Puppet and Chef with a tool like Oracle Ravello (if your public cloud is Amazon Web Services or Google) for your VMware and KVM virtual deployments, you can literally nest hypervisors so that you can run your virtual environments in the public cloud as they are — with networking, addressing and more — without any changes, deployed in an automated way. These are powerful tools both from a DevOps and a disaster recovery perspective. You can quickly build software, test it, deploy it, mitigate bugs and provide robust failover and recovery solutions using these same tools. Continuous deployment becomes continuous disaster recovery and failover.

The prime tenet of DevOps is that developers should be writing code and testing their apps in a copy of the production environment in which their code will run. This is often nearly achieved using virtual machines and container solutions like Docker running on individual laptops or desktops. This is much better, of course, than just writing code blindly in Xcode or Visual Studio and then shipping packages to system administrators to deploy, but I said <i>nearly</i> in the previous sentence because even this type of virtualization does not entirely simulate the capacity constraints of a real world production environment. It’s difficult to test a real load against a containerized microservice running on an Apple MacBook Air, for example, but load testing could be carried out in a more realistic, actionable way against a full Azure Stack service deployment, for instance.

Disaster recovery environments as potential DevOps workspaces

Some companies find that, as they further embrace DevOps on a deeper level, their developers will constantly ask for access to the hot spare disaster recovery environments to mitigate that “single laptop or desktop” constraint. Generally, midsize and larger organizations have significant investments in backup infrastructures that are almost exactly carbon copies of the environments that are running critical production workloads and are at standby in case those workloads need to be ported over in the event of a service disruption or a disaster.

Today, they are mostly standing idle except for a few recovery exercises each year. But think about it: That’s a lot of unused capacity sitting there waiting for an event that may happen for just a few hours a year. Is there a way to take advantage of virtualization technologies, storage area networking and software-defined networking so that the hot spare disaster recovery capacity can be used as a DevOps “workspace” for testing and planning while also remaining available for its original purpose — to accept failed-over workloads?

The answer is yes. Hypervisors and virtual machines can be spun up and down in a matter of seconds, and software-defined networks can be rerouted and shunted to different connections and endpoints via scripts in an automated fashion. In fact, in most cases the data is already accessible to the environment, so regular application and operations testing can actually work with real data without long copying times. Could you ask for a better real world performance testing environment than to work in essentially a carbon copy of your production infrastructure?

In the event of a need for failover, you can have your monitoring system fire those scripts, change the storage endpoints and shut down the virtual machines — and then you have your hot spares back. At the conclusion of the “disaster,” you can manually restore services by reverting all of the changes those automated scripts made. PowerShell is a great way to do this in Windows, Hyper-V and VMware environments, and Bash scripts are decent for Xen and other hypervisors, too. Of course, Puppet and Chef and Ravello can help out with this sort of orchestration as well.

The idea here is get some of that unused capacity doing something useful while also not completely missing the purpose of its existence in the first place. Developers need access to that big iron to do actual testing and teasing out performance problems at higher capacities and loads than their development machines alone can support. Having hot standby infrastructure doing nothing but being hot and standing by is perhaps the polar opposite of embracing DevOps; by reimagining this sort of application, it’s possible to have your cake and eat it, too, if you will.

Questions to ask

As you begin thinking more about continuous disaster recovery, here are some points to consider with your team.

How do we “tabletop” our disaster recovery procedures? Who owns the checklist of procedures to follow? Who fires the scripts or is responsible for the automation? How can we simulate failure of a single application, an entire workload, and the infrastructure itself? What types of scenarios would cause failures in each of those key elements?

What natural employee strengths can I emphasize around disaster recovery? While DevOps tends to blend the roles of developers and operations folks, there will naturally be employees with stronger tendencies and experiences in operations who should be empowered to deal with the accountability that failover requires. On the development side, coders should be held accountable if their code is responsible for causing a failover event, and those coders who gravitate toward being superior debuggers might want to pick up some slack here.

How can I better utilize the disaster recovery sites and existing infrastructure I’ve already paid for? Is that environment set up for easy virtualization buildup and tear-down? And if it isn’t, what do I need to do to get to that ready state?