The Cloudy Truth about Service Level Agreements

Comments

The most important issue is, how should you think about application outage and what are your options for improving uptime?

As a starting point, keep in mind Voltaire's observation: "Le mieux est l'ennemi du bien." Loosely translated, that means, Perfection is the enemy of good. Applied to cloud computing, this might be thought of as "Don't avoid adopting a cloud provider because it can't guarantee 99.999 percent uptime, when one's own data centers fall far short of acceptable uptime."

If adopting cloud computing improves uptime significantly, it's the right thing to do. If there are no actual statistics of the uptime availability of one's own computing environment, that's a telling sign that moving to a cloud provider is a step in the right direction. It may not be perfect, but it's way better than an environment that can't even track its own uptime. Believe me, there are many, many IT organizations with nothing more than earnest assurances about their uptime performance.

Here are some steps you can take to improve your application uptime:

1. Architect your application for resource failure. Perhaps the greatest single step you can take to improve your application's uptime is to architect it so that it can continue performing in the face of individual resource failure (e.g., server failure). Redundancy of application servers ensures the application will continue working even if a server outage kills a virtual machine. Likewise, having replicated database servers means an application won't grind to a halt if one server hangs. Using an application management framework that starts new instances to replace failed ones ensures redundant topologies will be maintained in the event of an outage.

2. Architect your topology for infrastructure failure. While judicious design can protect application availability in the event of an application hardware element failure, it can't help you if the application environment fails. If the entire data center that one's application runs in goes down, use of redundant application designs is futile. The answer in this case is to implement application geographic distribution so that even if a portion of one's application becomes unavailable due to a provider's large-scale outage, the application can continue to operate. This makes application design more complex, of course, but it provides a larger measure of downtime protection.

3. Architect your deployment for provider failure. Of course, it is possible for a cloud provider's complete infrastructure to go offline. Even though the circumstances under which this might occur are quite rare, it is within the realm of possibility. For example, the provider's entire network infrastructure could down, or the cloud provider might abruptly shut down. Far-fetched, perhaps, but both scenarios have happened with online services in the past. The solution is to extend your application's architecture across multiple providers. Despite what many vendors will proclaim, doing so is extremely challenging because the semantics of how cloud providers vary makes it difficult to design an application that can incorporate differing functionalities. Nevertheless, it is possible to implement this application architecture with sufficient planning and careful design.

What should be obvious from this discussion is that higher levels of uptime certainty require increased levels of technical complexity, which translates into increased levels of investment.