Helder Almeida - Fotolia

Google Compute Engine experiences global cloud outage

An apparent network connection failure led to a two-hour, cross-region cloud outage for Google Compute Engine customers this week.

Google this week became the latest public cloud vendor to experience a worldwide cloud outage.

Google Compute Engine (GCE), the company’s infrastructure as a service offering, went down on Wednesday just before 11p.m. PT across all its zones—Google’s term for what is more commonly referred to in cloud computing as regions. The system came back online two hours later. This comes three months to the day after Microsoft Azure's five hour global cloud outage.

From what’s been made publicly available, it was a "substantial outage," according to Jason Read, founder of CloudHarmony, a Laguna Beach, Calif.-based company that conducts independent, third-party monitoring of cloud vendor uptime.

“It’s the worst type of outage because often you design for cloud failure with failover and load balancing between regions," Read said. "If all regions go down that negates the ability to get through this type of outage."

The only way to get through this type of scenario is by using a multi-cloud deployment that would allow a customer to failover to another vendor’s platform, Read said.

The cloud outage was apparently tied to a problem with network connectivity, Google said on its status page, once services were returned to normal.

"We are sorry for any issues this may have caused to you or your users and thank you for your patience and continued support," Google's post read. "Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems."

Workiva, an Ames, Iowa-based Google cloud customer and financial reporting software developer, wasn't impacted from the outage, said Dave Tucker, senior director of platform development. The lack of disruption could be because Workiva’s user traffic doesn’t typically go directly to GCE or because its services on GCE are not high usage yet, especially at off hours -- but the company still took notice.

"Issues like this are always a concern for a company like ours," Tucker said. "This is part of the reason we are focusing on containers for our deployment strategy and working across both Google and Amazon so that we can fail over to another environment if necessary."

Google plans to provide a detailed analysis of the incident following its internal investigation.

A company official did not immediately respond to request for comment.*

Cloud outages are more common among less-mature platforms, and Google deserves credit for standing by its promise to provide updates on its status page every half hour, Read said. He wasn’t aware of any instances of global Elastic Cloud Compute outages for industry leader Amazon Web Services, but the company went through its share of problems as it matured, too, he added.

"Google Compute Engine is a much newer service," Read said. "I’m sure they’re learning and improving on whatever caused this particular outage so it won’t occur again and they’ll probably provide a pretty good postmortem”

If anything, Google was lucky that the outage occurred in the middle of the night, unlike Azure’s, which occurred in the middle of the day and generated a considerable amount of negative PR for Microsoft, Read said.

*Update: The total time of external traffic loss for network connectivity to instances on Google Compute Engine was two hours and 40 minutes, while the peak impact lasted 40 minutes and affected 70% of all instances, Google said on its status page late Thursday. The internal software system for GCE's virtual network for VM egress traffic stopped issuing updated routing information, and traffic was dropped as cached routes expired. The root cause of the outage remains under investigation, Google said.

A reload was forced to fix the problem roughly an hour after first being identified and before all entries expired, Google said. The expiration lifespan of routing entries has been extended from hours to weeks in case a similar problem occurs again, and more in-depth changes to the system are expected in the coming days.

A company official declined to comment further.

Trevor Jones is the news writer for SearchCloudComputing. You can reach him at [email protected].

Dig Deeper on Cloud computing security