Amazon Web Services has improved the way it responds to service issues, and users are responding positively. AWS...
suffered a no-fooling-around stroke of bad luck on April 1, with a three-hour partial outage at its North Virginia data center.
Access to the application programming interface (API) that lets users communicate with and control their AWS services went down for three hours in the early morning. In a move hailed by users, AWS posted a lengthy, frank and very detailed explanation of the problem, including missteps the company made in initially diagnosing the issue.
"While our deployment safeguards should have prevented this issue entirely, it also took our team too long to diagnose the root cause and recover. This issue should have been significantly easier for our technical teams to understand and resolve," the AWS blog statement read in part.
The AWS status monitoring page explained that a software upgrade to the control layer of Amazon's EC2 service was not properly tested and went into production, subsequently causing the issue.
Amazon said virtual machines already running weren't affected, but users were not able to control their environments during the blackout and were unable to turn servers on or off. Amazon touts fast scalability and elasticity as a primary virtue, and many of the Web-centric businesses that use it count on that feature to handle spikes in traffic.
Amazon's outage aftermath
Overall, the three-hour hiccup probably didn't have too much fallout in business terms for AWS users. However, users are calling Amazon's response spot-on and a welcome change from past silences on service problems. Developer and consultant Mitch Garnaat specializes on AWS and called the company's handling of the issue a relief.
"This response hit the mark for me," he said in an email.
It's exactly what I'd want in an event write-up.
John Kinsella, founder of Protected Industries, on Amazon's response to the outage
Garnaat said he expects periodic updates during a problem, a prompt and detailed post-mortem examination and a message about how similar problems can be avoided.
Garnaat said that some of Amazon's historic communication problems around AWS arise from its history as a retailer first and IT services provider second. He said that Amazon was, like all retailers, justifiably paranoid about giving out business information and protecting customer data, such as credit card details. In the retail world, consumers don't want to know the process; they just want to get busy.
"On the retail side, the only people this level of information would help would be competitors, so why bother?" he said.
Garnaat said that was a very different place to start from than traditional IT providers, which have learned that more disclosure is better than less.
"AWS is a different set of customers with a different set of requirements. I think AWS is trying hard to respond appropriately," he added.
"It's exactly what I'd want in an event write-up," added John Kinsella, founder of Protected Industries. Kinsella said he needs to know as much as possible about anything that goes wrong because he, in turn, is on the hot seat when services don't work. The more he knows, the better able he is to handle unhappy end users.
"Information reduction is easy -- it's when a vendor doesn't give me enough information and people start second-guessing that problems arise," he said. Kinsella was recently hit by an outage last week in Terremark's vCloud Express service (in beta testing) that left him in the dark for eight hours.
Kinsella said he uses EC2 and AWS services frequently but was not affected by this outage.
Carl Brooks is the Technology Writer for SearchCloudComputing.com. Contact him at email@example.com.