Amazon does not oversubscribe

TechTarget

Amazon CTO Werner Vogels said on Twitter that AWS does not oversubscribe its services.

“If you launch an instance type you get the performance you ask (and pay) for, period. No oversubscription,” he wrote. An earlier message said that CPU performance is fixed for each instance, and customers were granted access to the full amount of virtual CPU, an Amazon designated Elastic Compute Unit (ECU).

Why is this important? For one thing, it’s another data point about AWS operations dribbled out: the company is famously tight-lipped on even completely innocuous matters, let alone operational details. This allows some more inferences to be made about what AWS actually is.

Second, EVERYBODY oversubscribes, unless they explicitly say they don’t.

Oversubscription is:
Oversubscription, in the IT world, originates with having a fixed amount of bandwidth and a user base that is greater than one. It stems from the idea that you have a total capacity of resources that a single user will rarely, if ever, approach. You tell the pool of users they all have a theoretical maximum amount of bandwidth, 1GBps on the office LAN, for instance.

Your average user consumes much less than that (<10 MBbs, say), so you are pretty safe if you say that 50 users can all use 10 MBps at the same time. This is oversubscription. Clearly, not everybody can have 1 GBps at once, but some can have it sometimes. Mostly, network management takes care of making sure nobody hogs all the bandwidth, or when congestion becomes an issue, more resources are ready. Why do this?

Oversubscription is a lot easier than having conversations that go like this:

Admin to office manager: “Well, yes, these wires CAN carry 1 GBps of data. But you only need about 10MBPs, so what we do is set up rules, so that…
What? No, you DO have the capacity. Listen, we can either hard limit EVERYONE to 10 MBps like it was, or we can let usage be elastic…What?
OK, fine. Everyone has a 1 GBps connection. Goodbye now.”

Problems with this model only arise when the provider does not have enough overhead on hand to comfortably manage surges in demand, i.e., they are lying about their capacity. Comcast and AT&T do this and get rightfully pilloried for fraud from time to time, airlines as well, and so on. that’s wrong.

Fundamentally, though, this is a business practice based on statistically sound math. It makes zero sense to give everyone 1000 feet of rope when 98% only ever need 36 inches.

And everybody does it
It’s also par for the course in the world of hosting. Bear in mind that a service provider is not lying if it promises you a single CPU core and 1 GB RAM, and then puts 100 customers on a box with 16 cores and 36 GB RAM. It is counting on the fact that most people’s servers and applications can comfortably run on a pocket calculator these days. When demand spikes, the service provider turns on another server and adds more power.

“Problem” customers, who use the advertised resources, go to dedicated boxes if needed, and everyone is happy. The provider thus realizes the vaunted economy of scale, and the customer is content. Service providers often don’t oversubscribe more expensive offerings as a marketing bullet point or to meet customer wishes and provide high touch customer service. It’s a premium to get your own box.

Which means…
The fact that Amazon does not oversubscribe is indicative of a few things: first, it hasn’t altered its core Xen hypervisor that much, nor are users that far from the base infrastructure. Xen does not allow oversubscription per se, but of course Amazon could show customers whatever it wanted. (This is also largely true of VPS hosters, whose ‘slice’ offerings are often comparable to Amazon’s in price: ~$70/mo for a low end virtual server instance).

This allows us to make a much better guess about the size of Amazon’s Elastic Compute Cloud (EC2) infrastructure. Every EC2 instance gets a ‘virtual core,’ posited to be about the equivalent of a 1.2 GHz Intel or AMD CPU. Virtual cores are, by convention, no more than half a real CPU core. A dual core CPU equals four virtual cores, or four server instances. AWS servers are quad CPU, quad core, for the most part(this nugget is courtesy of Morphlabs’ Winston Damarillo, who built an AWS clone and studied their environment in detail). So, 16 cores and 32 virtual cores per server.

Guy Rosen, who runs the Jack of all Clouds blog, estimates the use of AWS regularly. In September 2010, AWS was home to 3,259 websites. In September-October 2009, Rosen came up with a novel way to count how many servers (each of which had at minimum one virtual core, or half a real CPU) Amazon provisions each day.

He said that AWS’s US-EAST region (one data center with 4 Availability Zones in it) launched 50,212 servers a day. At that time, AWS overall served 1,763 websites. Assume this growth is consistent, and Amazon is now serving 184% more instances. let’s say 93,000 server requests a day at US-EAST.

Physical infrastructure thus has to consist of at least 50,000 CPU cores at this point, although this is an inductive figure, not a true calculation. It is also quite conservative. Growth at AWS might have been better than double.That’s 3,125 actual servers to run those 50,000 nodes and 93,000 virtual machine instances.

Amazon’s cloud in Virginia runs on 3125 servers?

What? No Way.
Let’s be generous, and take into account the new HPC instances, all the overhead they must keep around, and factor in the use of large and extra large EC2 instances. We’ll give them 4,000 servers, 128,000 virtual CPUs.

US-EAST runs on 4,000 servers, or 100 racks. That could fit in 10,000 sq ft of data center, if someone really knew what they were doing. Equinix’s (just picking that name out of thin air) flagship DC/Virginia facilities operate 155,000 sq ft of Tier 3 space — if i’m even remotely in the ballpark, US-EAST, including cages and crash karts, could fit on one wall.

AWS cut prices again on Tuesday, by the way.

What was that about economies of scale again?