News Stay informed about the latest enterprise technology news and product updates.

Nimbus cloud project saves brainiacs' bacon

Virtualization and an open source cloud manager made an Amazon EC2 project timely for a heavy ion collider at Brookhaven National Laboratory.

Researchers at Brookhaven National Laboratory found themselves under the gun earlier this year, and used the open source Nimbus Project for a dramatic demonstration of the flexibility of cloud computing.

Experiments from the Relativistic Heavy Ion Collider (RHIC) were due to be presented at the Quark Matters conference, but researchers at Solenoid Tracker at RHIC (STAR) found themselves without readily available computer time to run a last minute project.

So they used Nimbus, a suite of virtual machine management and provisioning software, to move their in-house testing platform based on Linux, into Amazon's Elastic Compute Cloud to process more than 1 million "events" over about ten days on a 300-node cluster of virtual servers.

More on high-performance computing in the cloud:
Bandwidth issues deter HPC users from cloud services

HP intros RAM- and CPU- packed blade for cloud computing, HPC
STAR's calculations involved simulated event reconstruction - they use data gathered from the digital detectors buried in the RHIC to simulate every possible position and state of particles smashed together by the machine - an event. Each event took around 90 seconds to process and this experiment consisted of 1.2 million events.

In this case, Dr. Jerome Lauret, head of computing resources at STAR called it ideal for EC2, as they had comparatively small I/O needs and no storage requirements. They ran a massively parallel server cluster, fed data up and received results down from EC2. Despite consuming nearly 2 TB of data, bandwidth wasn't a concern since the computation took so long; data transfer took place while calculations took place.

Lauret said that it was a last minute idea that researchers brought to him, and he had to make a fast decision. "Do I shoot this project down," he said, "or look at the easy availability of computing power to make it work?" He was assisted in the relatively novel project by STAR production manager Lidia Didenko and technology analyst Levente Hajdu.

It was a learning experience on several fronts. Kate Keahey, creator of the Nimbus project and a researcher at the Argonne National Laboratory in Illinois, said the cluster started "as 100 nodes on [EC2's] 10-cent [per hour] servers and it ran 4 times slower" than it would have at a comparable cluster at Brookhaven. STAR began the project over a weekend and Lauret said that on Monday, they realized they would miss their target "by a factor of two, and the panic started to set in."

The favorite after-dinner pastime right now is to pull out napkins and calculate their cluster cost on Amazon.
Kate Keahey,
creator of the Nimbus project and a researcher at the Argonne National Laboratory in Illinois
Keahey said the solution was simple. "Within hours, we had a pow-wow and stood up another cluster" of 300 nodes. It took 24 hours for the data already uploaded into their first cluster to "drain" into the more powerful server cluster.

Additionally, the price was right - Lauret said that Amazon did not charge him for the compute time. Amazon has sponsored other research projects in the cloud, but it's not clear what the criteria are.

Keahey says that STAR was an ideal match for Nimbus and EC2 since their needs are infrequent. They have "the scientific equivalent of the post-Thanksgiving Day retail rush" when processing data, but long periods when there is little computing work to be done. Nimbus provides a frontend to Amazon's cloud that allowed STAR to use its already-validated clustering software to deploy and take down as many images as needed with very little effort. Nimbus supports the Amazon APIs and plans to add support for major grid and cloud providers to eventually become an "invisible layer" between public and private clouds and grid providers.

Traditional grid model wouldn't cut it

Lauret said that on the traditional grid model, there were issues. "When you submit a job on a grid, you don't know, a priori," what kind of computing platform it will end up on, so you have to either prepare for every possible situation, a "giant undertaking", or you build jobs on the fly and adjust your software based on the results you get back. Lauret said that because their modeling software is so complex, most well-known models of grid interoperability don't necessarily apply, so "if you build on the is not always possible for a complex framework with external dependencies."

Virtualization both solved these issues for STAR and allowed Lauret the ability to run experiments very quickly on already vetted software. gives him "Confidence [in results] that is absolutely unique". In scientific simulation modeling, verifying the integrity and quality of software used to run the simulation is critcal. Normally, the process of creating rigorously tested software takes months of painstaking work. Lauret said being able to use already designed and tested software gave him "confidence [in results] that is absolutely unique" for a simulation run on such short notice.

Both Keahey and Lauret point to cost concerns. "Would I actually pay for the next one? Yes, but the price must be right," Lauret said. He added that the various scientific communities that need lots of computing power should take a long hard look at the needs before jumping into cloud computing. He said that moving data and storing data would be the primary cost drivers for research groups. "If we are doing simple [computing] jobs, we will not compete with Amazon", he added. Keahey said in her circles, "The favorite after-dinner pastime right now" is to " pull out napkins and calculate their cluster cost on Amazon."

The scientists also said that the appetite for scientific computing is nearly unlimited if the price is right. Asked how fast she thought researchers could use up current capacity for the cloud and grid providers, Keahey said, "tomorrow." Lauret said that by way of comparison, each of Google's new floating data centers equaled the compute capacity of Brookhaven. "So give me just one." he joked.

Carl Brooks is the Technology Writer for Write to him at And check out our Troposphere blog.

Dig Deeper on Open source cloud computing

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.