updated 11:25 am EDT, Fri April 29, 2011
Issues credit to customers, promises changes
Amazon has offered an apology and detailed explanation for the disruption of its cloud storage service last week. During an operation to increase network capacity, traffic to the Elastic Block Storage system was incorrectly rerouted to a slower secondary backup network. The service outage affected a number of websites, including Foursquare and Reddit.
Normally, data is simultaneously synchronized across several nodes. If a node detects that its partner isn't responding, it assumes a failure and requests the server to spawn another. Normally, this process is automatic and happens so quickly that human intervention isn't needed. However, the shift to the slower network overwhelmed the peer-to-peer synchronization system, resulting in a massive bottleneck beginning at 12:47 AM PDT on April 21.
Engineers quickly diagnosed the nature of the problem, but restoring the balance of network traffic manually is a delicate and involved task. Within two hours, Amazon engineers were able to tamp down the network traffic without affecting other functions, and by 11:30AM had resolved the problem for all but 13 percent of the EBS volumes. Finding extra capacity to fully restore service required physically moving new servers into the affected data storage clusters. That operation didn't begin until 2:00 AM the following day, April 22.
Rebalancing the network traffic manually took most of the next two days, but by 6:50 PM on April 23, normal system operation was finally restored. However, 2.2 percent of the "stuck" volumes would have to be recovered manually. Eventually, all but 0.07 percent of the data was recovered.
The company is reviewing its procedures for making changes to its network, stating that it will "increase automation" to avoid similar human errors from happening in the future. Amazon is promising a list of other improvements, such as having more capacity available for recovery.
Amazon also issued an apologizy to its EC2 customers, stating:
Last, but certainly not least, we want to apologize. We know how critical our services are to our customers' businesses and we will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes.
Customers are getting a 10 day credit, whether their service was affected or not. [via All Things Digital]