AAPL Stock: 117.81 ( -0.22 )

Printed from

Amazon explains its cloud storage outage

updated 11:25 am EDT, Fri April 29, 2011

Issues credit to customers, promises changes

Amazon has offered an apology and detailed explanation for the disruption of its cloud storage service last week. During an operation to increase network capacity, traffic to the Elastic Block Storage system was incorrectly rerouted to a slower secondary backup network. The service outage affected a number of websites, including Foursquare and Reddit.

Normally, data is simultaneously synchronized across several nodes. If a node detects that its partner isn't responding, it assumes a failure and requests the server to spawn another. Normally, this process is automatic and happens so quickly that human intervention isn't needed. However, the shift to the slower network overwhelmed the peer-to-peer synchronization system, resulting in a massive bottleneck beginning at 12:47 AM PDT on April 21.

Engineers quickly diagnosed the nature of the problem, but restoring the balance of network traffic manually is a delicate and involved task. Within two hours, Amazon engineers were able to tamp down the network traffic without affecting other functions, and by 11:30AM had resolved the problem for all but 13 percent of the EBS volumes. Finding extra capacity to fully restore service required physically moving new servers into the affected data storage clusters. That operation didn't begin until 2:00 AM the following day, April 22.

Rebalancing the network traffic manually took most of the next two days, but by 6:50 PM on April 23, normal system operation was finally restored. However, 2.2 percent of the "stuck" volumes would have to be recovered manually. Eventually, all but 0.07 percent of the data was recovered.

The company is reviewing its procedures for making changes to its network, stating that it will "increase automation" to avoid similar human errors from happening in the future. Amazon is promising a list of other improvements, such as having more capacity available for recovery.

Amazon also issued an apologizy to its EC2 customers, stating:

Last, but certainly not least, we want to apologize. We know how critical our services are to our customers' businesses and we will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes.

Customers are getting a 10 day credit, whether their service was affected or not. [via All Things Digital]

by MacNN Staff



Login Here

Not a member of the MacNN forums? Register now for free.


Network Headlines

Follow us on Facebook


Most Popular


Recent Reviews

Ultimate Ears Megaboom Bluetooth Speaker

Ultimate Ears (now owned by Logitech) has found great success in the marketplace with its "Boom" series of Bluetooth speakers, a mod ...

Kinivo URBN Premium Bluetooth Headphones

We love music, and we're willing to bet that you do, too. If you're like us, you probably spend a good portion of your time wearing ...

Jamstik+ MIDI Controller

For a long time the MIDI world has been dominated by keyboard-inspired controllers. Times are changing however, and we are slowly star ...


Most Commented