Now that La La Land Moonlight has won the Academy Award for best picture, this is as good a time as any to look back at some screw-ups in the world of cloud computing. May we all learn from our mistakes.
The Force is not with you: Take a trip back to May 9, 2016, less than a year ago. It was on that day the Silicon Valley NA14 instance of Salesforce.com went offline, a condition colloquially known as Total Inability To Support Usual Performance (I’m not going anywhere near the acronym). Customers lost several hours of data and the outage dragged on for nearly 24 hours. CEO Mark Benioff took to his Twitter account to ask for forgiveness. Shortly after, Salesforce moved some of its workloads to Amazon Web Services.
AWS giveth, AWS taketh away: Though transferring workloads to AWS helped Salesforce recover lost customer confidence (though not lost data), the opposite was true for Netflix. On Christmas Eve 2012, at a time when kids might be watching back-to-back-to-back showings of A Christmas Story, problems with AWS’s Elastic Load Balancing service caused Netflix to go down. This Grinch stole Christmas not just from little Cindy Loo Who, but from millions of paying subscribers waiting to see if Ralphie gets his dreamed-about Red Ryder BB rifle. Lessons were learned. Two years later, during a massive AWS EC2 update, Netflix rebooted 218 of its 2,700 production nodes. Alarmingly, 22 failed to reboot, but, the Netflix service never went offline. At the opposing end, Dropbox went old school in March 2016, dumping AWS and moving its entire operation onto its own newly built, enormous infrastructure.
Those darn updates’ll getcha every time: Amid verdant woodlands, beneath pure azure skies, protected by mountains, our cloud service lies. That bucolic portrait of the Pacific Northwest (or New Hampshire, perhaps) mattered little to Microsoft on Nov. 18, 2014 when the Azure Storage Service suffered a widespread outage traced back to the tiered rollout of software updates intended to improve performance. “We discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected…” was the blogged explanation. Another major outage occurred in Dec. 2015.
Eat in, Dyn out: The Oct. 21, 2016 wave of coordinated distributed denial-of-service attacks targeting Domain Name System provider Dyn impacted dozens of high-profile businesses to varying degrees. These included Airbnb, Twitter, Amazon, Ancestry, Netflix, PayPal, and a long list of others. Dyn’s own detailed post-mortem of the attack makes for fascinating reading. If you think it’s impossible for millions of geographically far-flung, seemingly unrelated IoT devices to attack in a coordinated manner, think again.
You’ve heard of Office 360? Sure you have. The name is favored among cynics who joke that Microsoft’s cloud-based productivity software should be called that because it is offline five days out of every year. Office 365’s e-mail service was down for many users for about 12 hours on June 30, 2016. That follows other outages in various geographies on Dec. 3, 2015; Dec. 18, 2015; Jan. 18, 2016; and Feb. 22, 2016.
Got healthcare? We all know the stories about how healthcare.gov kept crashing due to poor design, inadequate compute resources, demand that vastly exceeded expectations, and so on. Enough said.
What’s that one cloud disaster story you’ve been dying to share? Now’s your chance. Tell us all about it; we’d like to hear from you.