Rackspace followed Amazon’s lead, albeit giving customers shorter notice in some cases, by rebooting its cloud systems due to an apparent Xen hypervisor bug.
Rackspace notified customers via email late last week of the need to do "infrastructure maintenance" on its global public cloud system, saying it would spend three days shutting down and fixing the problem region by region.
At Sprout Social, a Chicago-based Rackspace customer with a complicated infrastructure and 166 affected servers, the initial reactions were shock, concern and frustration, CTO Aaron Rankin said. Sprout Social, which makes social media tools for businesses, works to provide customers 100% uptime, and there was little that could be done to ensure it could maintain that during the Rackspace fix.
"No matter how prepared you are, having [servers] all rebooted at a random time in a concentrated window is asking for things to break," Rankin said.
Rackspace should have acted faster and given more advanced notice, Rankin added. He received the advisory email at midnight his time that Sprout Social’s cloud environment would be rebooted in 18 hours.
Aaron RankinCTO, Sprout Social
"I scrambled our team to work through the weekend to prepare," Rankin said. "Had we known a couple days earlier, we could have dealt with it more calmly."
"Rackspace has identified a bug in the software for our Cloud Servers (Standard, Performance 1 and Performance 2) and we are running a global maintenance, in which reboots are occurring," a company statement said. "All reboots are expected to be completed in the next 72 hours."
Company officials were not made available for comment.
Customers were advised to verify that all necessary services are configured to start on server boot, ensure critical data is backed up and confirm any unsaved changes -- including firewall rules and application configurations -- are saved.
Customers can go to the company’s status page for the most current state of the system.
Measuring the Rackspace reboot aftermath
It's unclear what percentage of Rackspace customers the bug would affect, as the email said it could have affected a "portion of the Public Cloud environment." Amazon said its reboot would only impact 10% of customers.
Sprout Social had 20 minutes of full downtime. Certain features were directly affected as the corresponding infrastructure was taken down, rendering them fully unavailable, partially unavailable or operating with degraded performance. Despite the impact, it doesn't give Rankin pause about staying with Rackspace.
"It's just one of those things," Rankin said. "If we ran our own infrastructure we would also want to apply the same patches. We'd just have control over when."
It's possible Rackspace tried to patch the problem internally to avoid the reboot as a worst-case scenario, according to Jillian Mirandi, an analyst with Technology Business Research, Inc., based in Hampton, N.H. And while these situations are always difficult to navigate, the lack of notice could have been handled better, she added.
It's been a rough couple of months for Rackspace, which took some hits on Wall Street and was at the center of acquisition rumors throughout the year. But if this latest news is at all detrimental, it's more likely to hit public cloud computing in general, including the likes of Amazon, Google and, to a lesser degree, Microsoft, Mirandi said. It also could provide an opening for companies like IBM and HP that are pitching hybrid IT as a way to avoid such issues by bursting to another cloud or temporarily migrating to a different data center.
"I don't think it's really going to hurt Rackspace any more than the rumors about being sold did," Mirandi said. "It's the uncertainty that hurt."
Update: On Oct. 1st, Rackspace CEO Taylor Rhodes apologized for a "few dropped balls" when communicating with customers and executing the Xen reboot. The issue has since been resolved and there were no reports of compromised data, Rhodes wrote in an email to customers.
Rackspace couldn't provide specifics on the security vulnerability until it was fixed, so as to not "ring a dinner bell for the world's cyber criminals," Rhodes wrote. The bug could have allowed individuals to follow a series of memory commands to read customers' data or crash the host server.
The reboot affected approximately 50,000 Rackspace customers. Rhodes acknowledged the company made mistakes and that steps are being taken to correct them going forward.
"Some of our reboots, for example, took much longer than they should. And some of our notifications were not as clear as they should have been."
Trevor Jones is the news writer for SearchCloudComputing. You can reach him at firstname.lastname@example.org.