Researchers at Brookhaven National Laboratory found themselves under the gun earlier this year, and used the open source Nimbus Project for a dramatic demonstration of the flexibility of cloud computing.
Experiments from the Relativistic Heavy Ion Collider (RHIC) were due to be presented at the Quark Matters conference, but researchers at Solenoid Tracker at RHIC (STAR) found themselves without readily available computer time to run a last minute project.
So they used Nimbus, a suite of virtual machine management and provisioning software, to move their in-house testing platform based on Linux, into Amazon's Elastic Compute Cloud to process more than 1 million "events" over about ten days on a 300-node cluster of virtual servers.
In this case, Dr. Jerome Lauret, head of computing resources at STAR called it ideal for EC2, as they had comparatively small I/O needs and no storage requirements. They ran a massively parallel server cluster, fed data up and received results down from EC2. Despite consuming nearly 2 TB of data, bandwidth wasn't a concern since the computation took so long; data transfer took place while calculations took place.
Lauret said that it was a last minute idea that researchers brought to him, and he had to make a fast decision. "Do I shoot this project down," he said, "or look at the easy availability of computing power to make it work?" He was assisted in the relatively novel project by STAR production manager Lidia Didenko and technology analyst Levente Hajdu.
It was a learning experience on several fronts. Kate Keahey, creator of the Nimbus project and a researcher at the Argonne National Laboratory in Illinois, said the cluster started "as 100 nodes on [EC2's] 10-cent [per hour] servers and it ran 4 times slower" than it would have at a comparable cluster at Brookhaven. STAR began the project over a weekend and Lauret said that on Monday, they realized they would miss their target "by a factor of two, and the panic started to set in."
Additionally, the price was right - Lauret said that Amazon did not charge him for the compute time. Amazon has sponsored other research projects in the cloud, but it's not clear what the criteria are.
Keahey says that STAR was an ideal match for Nimbus and EC2 since their needs are infrequent. They have "the scientific equivalent of the post-Thanksgiving Day retail rush" when processing data, but long periods when there is little computing work to be done. Nimbus provides a frontend to Amazon's cloud that allowed STAR to use its already-validated clustering software to deploy and take down as many images as needed with very little effort. Nimbus supports the Amazon APIs and plans to add support for major grid and cloud providers to eventually become an "invisible layer" between public and private clouds and grid providers.
Traditional grid model wouldn't cut it
Lauret said that on the traditional grid model, there were issues. "When you submit a job on a grid, you don't know, a priori," what kind of computing platform it will end up on, so you have to either prepare for every possible situation, a "giant undertaking", or you build jobs on the fly and adjust your software based on the results you get back. Lauret said that because their modeling software is so complex, most well-known models of grid interoperability don't necessarily apply, so "if you build on the fly...it is not always possible for a complex framework with external dependencies."
Virtualization both solved these issues for STAR and allowed Lauret the ability to run experiments very quickly on already vetted software. gives him "Confidence [in results] that is absolutely unique". In scientific simulation modeling, verifying the integrity and quality of software used to run the simulation is critcal. Normally, the process of creating rigorously tested software takes months of painstaking work. Lauret said being able to use already designed and tested software gave him "confidence [in results] that is absolutely unique" for a simulation run on such short notice.
Both Keahey and Lauret point to cost concerns. "Would I actually pay for the next one? Yes, but the price must be right," Lauret said. He added that the various scientific communities that need lots of computing power should take a long hard look at the needs before jumping into cloud computing. He said that moving data and storing data would be the primary cost drivers for research groups. "If we are doing simple [computing] jobs, we will not compete with Amazon", he added. Keahey said in her circles, "The favorite after-dinner pastime right now" is to " pull out napkins and calculate their cluster cost on Amazon."
The scientists also said that the appetite for scientific computing is nearly unlimited if the price is right. Asked how fast she thought researchers could use up current capacity for the cloud and grid providers, Keahey said, "tomorrow." Lauret said that by way of comparison, each of Google's new floating data centers equaled the compute capacity of Brookhaven. "So give me just one." he joked.