Google broke the mold on how to build and run data centers, as well as how to orchestrate the virtual machines...
that run the public cloud. And as the search giant combines megascale in the cloud with huge buying power, it allows new ground to be continually broken. While running your cloud environment like Google may seem out of reach, there is a trickle-down effect that brings benefits to enterprises. To understand the opportunities, it's necessary to look at how Google operates and how enterprises can emulate that.
Google cloud teardown
Before you can even begin to build your cloud like Google, you need to know all the components that go into its infrastructure. That includes servers, storage and orchestration tools. It's also important to know how to deal with server maintenance and repair.
Google specifies its own servers, working closely with its original device manufacturers (ODMs) in China to manufacture the units. In fact, all the large cloud service providers buy from these ODMs. Fundamentally, these are x64 designs and don't deviate much from the reference designs that Intel creates. They differ in the structures built around them.
Google is tight-lipped about its designs, but servers are mounted two or four to 1U, using 2 CPUs per server. Although this means nonstandard rack depths, Google's scale can make that affordable. The cloud giant is silent on its search engines' configurations. These may have GPUs or FPGA assist logic to speed up searches.
Storage and networking
Clouds use one of two storage models: networked storage that allows servers to be stateless, or locally attached drives that maximize performance. In a general-purpose cloud, data is stored on networked storage. For Google's search cloud, a distributed storage approach that puts disks in each server makes sense.
Automation and orchestration
The automated software used to manage the cloud is the secret sauce of Google's cloud operation. The scale of the company's data centers is too large for manual management operations. Google's orchestration suite is mainly homegrown. However, this isn't a serious option for most enterprises because builders can choose from a variety of commercial and open source tools. But Google does offer several tools, such as autoscaling or open source options, to help enterprises manage the scale of a cloud environment.
Server maintenance and other considerations
Server repair takes on a totally different approach in mega-clouds like Google versus enterprise cloud environments. Google studied failure rates over product lifecycles, especially for its own services and offerings. Not repairing a failed unit is tenable because it typically has a three- to four-year planned lifecycle.
With preconfigured and pretested modular data centers based on containers, Google's approach is wheel in, turn on and don't repair. The economics of low early-life failure rates and Google's ability to convert warranties into spare units and orchestrate instances from failed units automatically means most failures can be powered down and left sitting. If IT teams were to refurbish servers, it could be done at the same time.
Admins should still repair switch failures. The number of units they take offline is too high to ignore, and running with some connections on a downed server causes significant loss. Failed disk drives are likely not replaced until a refurb cycle or end of life.
Google also handles cabling differently from enterprise IT. Backbone connections to the containers or racks -- in older data centers -- are designed for long-term use and are always fibre. The distance across the data center makes recabling a major expense and a downtime risk. Containers have a four-year lifecycle, and connections from racks to the end-of-container routers change with each container.
How to build the enterprise version of Google's cloud
Google is blessed with enormous scale on everything it does. It has buying power and the ability to drive supplier design. But it's not all perfect. The speed with which problems surface would make most enterprise IT pros' heads spin. To prevent major problems, however, Google allows a latent design problem to be flushed early in a product's lifecycle.
Still, most enterprises can emulate some part of the Google cloud environment. Many of Google's methods carry over into smaller data centers and provide a basis for cost saving and operational efficiency. One method is to buy hardware pre-integrated to the rack level. These modular configurations allow nonstandard packaging and rack-level power and cooling, introducing lower-cost servers and more efficient power use.
In addition, ODMs are moving to sell units in the U.S., which should drive down server prices. These products range from traditional servers to stripped-down systems akin to those of cloud service providers. YouTube made some of its designs open source, allowing a number of manufacturers to produce interchangeable units. These provide an intermediate and commercially standard way to move toward stripped-down machines without the hassle of learning hardware design.
And copying Google's no-repair approach simply comes down to a business analysis. Deciding to replace failed drives is a common compromise, but it also means relying on a more traditional server with a drive caddy. High server costs point toward repair, but inexpensive cloud-style servers procured at much lower prices and coupled with drives at distributor prices can change the economics and make it more affordable for even the most budget-conscious enterprise.
About the author:
Jim O'Reilly was vice president of engineering at Germane Systems, where he created ruggedized servers and storage for the U.S. submarine fleet. He has also held senior management positions at SGI/Rackable and Verari; was CEO at startups Scalant and CDS; headed operations at PC Brand and Metalithic; and led major divisions of Memorex-Telex and NCR, where his team developed the first SCSI ASIC, now in the Smithsonian. Jim is currently a consultant focused on storage and cloud computing.