Back-up power fault to blame for AWS outage

Amazon Web Services apologise for ‘unexpected and unusual’ power redundancy failure

Comments

A failure in the back-up power system at AWS facilities in Sydney was to blame for the outage of its service at the weekend.

The company said its power redundancy system ‘didn’t work as designed’ meaning power was lost to a significant number of Sydney's Elastic Compute Cloud (EC2) and Elastic Block Store (EBS) instances in one of the availability zones.

A number of popular web properties went down during storms on Sunday, causing chaos for AWS clients including Foxtel Play, Channel Nine, Presto and Stan and their customers.

AWS said in a statement that it would “do everything we can to learn from this event and use it to drive improvement across our services”.

The company also said a ‘latent bug’ in its instance management software slowed the recovery of some instances.

Back up power fault

Severe weather in NSW led to the loss of power at a regional substation causing a black-out at multiple AWS facilities. A back up system failure meant power couldn’t be returned to EC2 and EBS instances.

Each instance has two back-ups: diesel rotary uninterruptable power supplies (DRUPSs) and generators.

DRUPSs use mains power to spin a flywheel which stores energy and continues to spin if power is interrupted. This stored energy should continue to provide power to the datacentre while generators fire up.

AWS said an unusually long voltage sag in the utility power supply during the blackout meant breakers in the system failed to open quickly enough. Rather than the DRUPS’ reserve of energy powering the data centre, it drained into the grid. Emptied of energy, the DRUPS shut down so weren’t able to connect the generators to the data centre racks.

“DRUPS shutting down this rapidly and in this fashion is unusual and required some inspection,” the company said. “Once our on-site technicians were able to determine it was safe to manually re-engage the power line-ups, power was restored.”‎

Latent bug

After power returned to the facility, AWSs automated systems were able to bring more than 80 per cent of impacted customer instances and volumes back online.

However, the company said a bug in its instance management software “led to a slower than expected recovery of the remaining instances”. The remaining had to be recovered manually.

A small number of storage servers suffered failed hard drives during the power event, which led to a loss of the data.

“In cases where both of the replicas were hosted on failed servers, we were unable to automatically restore the volume,” the company said.

“After the initial wave of automated recovery, the EBS team focused on manually recovering as many damaged storage servers as possible. This is a slow process, which is why some volumes took much longer to return to service.”

AWS promised to improve the power redundancy system with the addition of extra breakers and fix the latent bug. The fix it said was being tested and was due to be deployed this week.

It recommended customers run their applications across multiple Availability Zones in the region.

“We apologise for any inconvenience this event caused,” AWS said. “We know how critical our services are to our customers’ businesses. We are never satisfied with operational performance that is anything less than perfect.”