Amazon Web Services (AWS) recent storm-related outage, which left Web sites including Netflix, Pinterest, and Instagram inaccessible, is just the latest in a string of costly cloud failures. Since 2007, a total of 568 hours of downtime at 13 major cloud services providers had an economic impact of $71.7 million dollars, according to the International Working Group on Cloud Computing Resiliency (IWGCR). Average down time has been 7.5 hours per year, according to IWGCR, an availability rate of 99.9 percent well below the required reliability for mission critical systems. "Cheap cloud services can be expensive," says Kevin C. Taylor, partner in the business services department of law firm Schnader, Harrison, Segal & Lewis.
While the typical cloud contract contains uptime clauses and credits for missed service levels, it often fails to adequately protect the enterprise customer. Service-level agreement (SLA) credits, typically capped at a proportion of monthly service fees "do not compensate for business losses associated with the downtime of a production application," explains Taylor. "Even in an extreme case of sustained and severe outages the credit amounts will be derisory--[say,] $20,000--in comparison to the business impact to the customer, which could potentially be in the millions."
But there are questions the intelligent customer can ask to make sure they are sheltered from potential storms in the cloud.
1. Does your baseline uptime SLA meet my business needs? Buyers used to five nines (99.999 percent uptime) will be disappointed by the 99.9 percent uptime SLAs of cloud providers. "It's one of the first terms they should ask their prospective provider about to see if they can do better," says Jim Slaby, research director of sourcing security and risk strategies for outsourcing analyst firm HfS Research. "Buyers should also negotiate well-defined recovery point and recovery time objectives for each service in their contract." You will have to pay more. The typical active-active service configuration required to deliver 99.999 percent reliability can add as much as 50 percent to monthly service costs, Slaby says.
2. How do you define "uptime" and "downtime"? "Sophisticated customers will clearly spell out exactly what is considered downtime," says Todd A. Fisher, partner with law firm K&L Gates. "Does it mean five percent of the end users are affected? Or 25 percent? Or 50 percent? What if the system is technically working, but is running so slowly that end users can't do their jobs effectively?"
Many cloud providers also include a list of exclusions to uptime guarantees, including emergency situations, outages of 15 minutes or less, and availability during certain times of day. "Buyer should expect small, but well-defined exclusion windows that don't count against downtime so the provider can perform routine maintenance," says Slaby.
Beware of overly broad exclusions. For example, telecom outages are typically excluded with no distinction between services purchased by the customer itself (a legitimate exclusion) and those of the provider who should provide redundant telecom architecture to prevent a single point of failure, says Dr. Jonathan Shaw, principal with outsourcing consultancy Pace Harmon. "You may see exclusions for 'emergency maintenance' with no constraints on when the provider can call a maintenance emergency, effectively giving them a 'get out of jail free' card for any outage," Shaw says.
3. What about "Acts of God"? Most cloud contracts also exclude force majeure events--those outside the reasonable control of the vendor such as natural disasters, war, and labor strikes. "An event of force majeure can allow a vendor to get out of commitments, including SLAs," says Schnaders' Taylor, "Customers should negotiate a narrow definition of force majeure."
4. How stable is your cloud environment? Shaw of Pace Harmon advises clients perform technical due diligence on any cloud solution to estimate the risk of major outages. Look at how the cloud is structured. Is there a data center on an earthquake fault line or in a country prone to political instability?
5. What's your disaster recovery plan? "Buyers should really be digging into this," says Slaby of HfS Research. Request site visits and audits to estimate the vendor's achievable recover time and recovery point and use that to calculate the impact on your business of a potential failure. "This analysis may simply preclude a cloud solution [if] it is not possible to recover the cloud application sufficiently quickly to avoid a business-jeopardizing event," says Shaw.
6. How often do you test that plan? "Having a disaster recovery and business continuity plan in place does not ensure that downtime will be minimized in the event of a disaster," says Fisher of K&L Gates. "Unfortunately, some providers don't regularly test their plan so they can't be sure it will be effective in the event of a disaster." Smart shoppers will include a contractual requirement for semi-annual disaster recovery testing, compelling the provider to disclose the results to the customer and correct any deficiencies uncovered.
7. What are the best options for deployment? The most recent outage at AWS occurred only at specific facilities on the East Coast. As a result some customers were left in the dark, while others experienced no effects. "A significant factor in this difference was how the customers had deployed their cloud applications," says Shaw. AWS allows customers to deploy cloud components (processing, storage, databases) in different availability zones elastic load balancers (ELBs) to route traffic. "Customers can even eliminate the single point of failure by deploying multiple ELBs in different availability zones and using domain name system lookups to provide failover," says Shaw.
8. If something goes wrong, can I jump to the front of the line? In the event of a disaster, a vendor will be struggling to bring hundreds, if not thousands of customers back online. "If it's really important to be in the front of the line, you should consider paying extra for preferential treatment," says Fisher.
9. Can I walk away if I'm not satisfied? Customers should insist on a clause giving them the right to terminate without penalty if the provider cannot restore service after a predetermined period of time regardless of the cause of the downtime, says Slaby. "The absolute worst position to be in is to have a multi-year commitment to pay for a service that is not being delivered," says Shaw.
10. Can I look at your books? Acts of God, software bugs, and heavy traffic aren't the only risks to reliable cloud service. The business itself can also fail. "If the cloud provider goes bankrupt and simply stops providing the service, the SLA gets you nothing," says Shaw. "So financial due diligence and analysis of the business is also advisable."
Read more about it strategy in CIO's IT strategy Drilldown.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.