Disaster recovery in a DevOps world

Organizations that are adopting DevOps methodologies are realizing actual benefits from taking that approach.

Comments

Today, they are mostly standing idle except for a few recovery exercises each year. But think about it: That’s a lot of unused capacity sitting there waiting for an event that may happen for just a few hours a year. Is there a way to take advantage of virtualization technologies, storage area networking and software-defined networking so that the hot spare disaster recovery capacity can be used as a DevOps “workspace” for testing and planning while also remaining available for its original purpose — to accept failed-over workloads?

The answer is yes. Hypervisors and virtual machines can be spun up and down in a matter of seconds, and software-defined networks can be rerouted and shunted to different connections and endpoints via scripts in an automated fashion. In fact, in most cases the data is already accessible to the environment, so regular application and operations testing can actually work with real data without long copying times. Could you ask for a better real world performance testing environment than to work in essentially a carbon copy of your production infrastructure? In the event of a need for failover, you can have your monitoring system fire those scripts, change the storage endpoints and shut down the virtual machines — and then you have your hot spares back. At the conclusion of the “disaster,” you can manually restore services by reverting all of the changes those automated scripts made. PowerShell is a great way to do this in Windows, Hyper-V and VMware environments, and Bash scripts are decent for Xen and other hypervisors, too. Of course, Puppet and Chef and Ravello can help out with this sort of orchestration as well.

The idea here is get some of that unused capacity doing something useful while also not completely missing the purpose of its existence in the first place. Developers need access to that big iron to do actual testing and teasing out performance problems at higher capacities and loads than their development machines alone can support. Having hot standby infrastructure doing nothing but being hot and standing by is perhaps the polar opposite of embracing DevOps; by reimagining this sort of application, it’s possible to have your cake and eat it, too, if you will.

Questions to ask

As you begin thinking more about continuous disaster recovery, here are some points to consider with your team.

How do we “tabletop” our disaster recovery procedures? Who owns the checklist of procedures to follow? Who fires the scripts or is responsible for the automation? How can we simulate failure of a single application, an entire workload, and the infrastructure itself? What types of scenarios would cause failures in each of those key elements?

What natural employee strengths can I emphasize around disaster recovery? While DevOps tends to blend the roles of developers and operations folks, there will naturally be employees with stronger tendencies and experiences in operations who should be empowered to deal with the accountability that failover requires. On the development side, coders should be held accountable if their code is responsible for causing a failover event, and those coders who gravitate toward being superior debuggers might want to pick up some slack here.

How can I better utilize the disaster recovery sites and existing infrastructure I’ve already paid for? Is that environment set up for easy virtualization buildup and tear-down? And if it isn’t, what do I need to do to get to that ready state?