Mathias Rosenthal - Fotolia
Public cloud reboots came back into focus this month when a Xen vulnerability resulted in some providers and customers working overtime on the hypervisor for the second time since September.
Cloud providers had varying degrees of success in avoiding downtime during the Xen reboots, but it's clear that tolerance for service interruptions is waning as the always-on view of cloud is pushing platforms and customers to improve.
"That's going to be the price of entry now," said David Linthicum, senior vice president of Cloud Technology Partners, a cloud consulting firm in Boston. "People gave the reboot thing one pass, but live migration is really the standard. And if [cloud providers] can't support that and are rebooting every year, people are going to get pretty frustrated pretty quickly."
Xen reboot postmortem still fuzzy
Industry leader Amazon garnered the most attention for its handling of the open-source hypervisor vulnerability. Last fall, its Xen reboots impacted around 10% of Amazon Web Services (AWS) customers, but this time the company managed to address the problem while affecting only 0.1% of EC2 customers.
Despite Amazon's dramatic improvement between the Xen reboots, the results are not as straight forward for other cloud vendors dependent on the hypervisor. Linode LLC, which was mostly unaffected by the vulnerability last fall, said a "large percentage" of its fleet was impacted this time. Rackspace, which had roughly a quarter of its 200,000 customers impacted by the last reboots, did not disclose figures for this latest effort.
However, Rackspace said it made a number of changes from the reboot last fall. Most notably, customers were provided with two-hour reboot windows per region and per hypervisor -- a change that was put into place after hearing directly from customers about the previous reboot windows of eight hours per data center.
But it's more than just an exercise in hypervisor patches, as the lessons about compressed time and speed required to work at scale can be applied in other areas, said Paul Voccio, senior director of product software engineering at Rackspace.
"How do we incorporate this into other products and make sure it's a repeatable process and fully supported?" Voccio said. "We learned a lot from this event."
Patches aren't one-size-fits-all
Each cloud platform is different enough to rely on its own methods and preferences for patches and reboots, but the best fix is the one the customers don't notice, said Carl Brooks, an analyst with 451 Research LLC in New York.
"In theory, they should do it seamlessly and no one should notice and live migration should be a thing, but sometimes you do simply have to reboot what your infrastructure is on," Brooks said.
"It's one of those things where it's a fun party trick, but its total value to the infrastructure as a service experience is not fully justified," Brooks said.
Carl Brooksanalyst, 451 Research
The uses for live migration are generally limited to applications that have more legacy architecture, said James Staten, an analyst* with Forrester Research, Inc., based in Cambridge, Mass. But someone using that type of application likely wouldn't use it without a managed service contract anyway.
Amazon hasn't revealed how it side-stepped a major reboot, but it's believed the company used hot patching. It's a practice that works in this scenario, but wouldn't necessarily be right in other situations, Voccio said. Rackspace is exploring hot patching and used live migration this month, but different vulnerabilities will require different tools to address them, he added.
"One thing isn't the end all be all," Voccio said. "There's going to be a combination of tools each time."
Communication above all else
Reboots have been part of cloud computing since its inception. In the early days of AWS, Amazon rebooted its unmanaged VMs without advanced notice, and Microsoft Azure, at one point, had scheduled weekly rolling reboots. As recently as this year, Verizon made a design decision during the implementation of its cloud that required a fix while in production.
The Verizon reboot, which happened over a weekend in January, caused more ire over how it was communicated than how it was actually implemented, Brooks said.
"You have to be transparent with customers about what's going on," Brooks said. "It's all part of the tradeoff if you're going to automate stuff the IT guys are used to controlling themselves. You need to explain that in sufficient detail so that they feel comfortable."
IBM, which didn't provide estimates on affected customers for either vulnerability, notified customers of planned patches and reboots to a portion of servers that host portal-provisioned virtual server instances before March 10. The work was set for staggered windows across data centers to support customers with multi-site deployments and affected customers would receive specific times and dates once the schedule was set, the notice read.
Communication is a key reason why Munzee, a scavenger hunt game based in McKinney, Texas, uses Linode, said Scott Foster, Munzee vice president of technology. Munzee users were given advanced notice about when the work would take place and the internal team worked overtime to prepare for the two-hour window Linode said it might work on their servers. The work took a total of 20 minutes and the website was able to stay live, albeit somewhat slower, because Munzee has two web servers.
"It's one of those things where if they have to go down for a reason, there's nothing they can do," Foster said. "But the pros [of cloud computing] outweigh the cons immensely."
Foster saw reports of Amazon being able to maintain uptime while working on Xen and said that if Linode was doing this all the time, he would consider some of the larger cloud vendors. But, as is, he prefers the managed services.
"It comes with the territory," Foster said. "It's computers, and sometimes stuff happens."
Cloud users step up their game, too
Cloud users have different views when it comes to reboots. Those using managed hosting platforms such as Rackspace or Datapipe are often more inclined to let the vendor do the heavy lifting. However, developer-centric IT shops may want greater control of clusters and restarts, out of fear that a standard reboot procedure could break their applications, Staten said.
"Whether it's the hypervisor or underlying operating system, customers always have to be prepared for some sort of patch that has to be applied that would require a restart," Staten said. "If they're smart about their deployments, they're always running redundant instances anyway."
Without explicit support agreements, the best plan for IT shops is to protect themselves, Brooks said.
"Back up your own stuff," Brook said. "Have a plan, practice it frequently; you are on your own, fundamentally."
How prepared customers are for downtime depends on how they utilize cloud, said Tamara Budec, vice president of portfolio operations for Digital Realty, a San Francisco-based colocation and data center provider.
Companies that look at cloud as a means to simply reduce costs often have to learn the hard way about the need to build resiliency for any applications that have obvious impacts on operations, Budec said. Companies with more sophisticated strategies plan for the inevitability of downtime, she added.
"It's becoming more apparent, but there are always those new entrants to the cloud environment who are looking for bottom-line savings," Budec said.
*After being interviewed for this story, Staten took a job as chief strategist for Microsoft's cloud and enterprise group.
Trevor Jones is the news writer for SearchCloudComputing. You can reach him at firstname.lastname@example.org.