While the most powerful supercomputers get most of the press, a good amount of research uses midrange computers with around 500 nodes running in parallel, he said. NERSC will be deploying a 720-node cluster with 65 teraflops of performance, which could support many of these types of applications simultaneously.
Scheduling cloud experiments
NERSC will be experimenting with traditional batch scheduling and a dynamic virtual private cluster manager. In a batch environment, a scientist might need to wait three hours or three days. "That is not conducive to debugging, where you run [a program], find a problem and make changes. They want a rapid turn-around, and with a virtual private cluster, they can have complete and exclusive access to a modest number of nodes for an extended period of time," Broughton said.
The virtual private cluster will allow users to schedule time in advance for debugging or for interactive development. When a developer stops to analyze their results and load in a new job, the unused computer capacity can temporarily be reassigned to other users. This approach might make the platform more accessible to a larger audience, said Broughton, but it also may end up being an inefficient way of working with HPC.
In other cases, a scientist may want to get feedback, try a different set of parameters after they receive a result, or incorporate real time data into the loop, such as where to point a telescope.
NERSC is also exploring allowing researchers to create dynamically selectable images of their cluster environment. This will give researchers options for the software environment, such as the operating system (OS), application version, compilers and libraries.
Linking facilities at breathtaking speed
The NERSC and ALCF facilities will be linked at a groundbreaking 100 gigabytes per second, which will facilitate rapid transfer of data between geographically dispersed clouds and enable scientists to use available computing resources regardless of location. The two teams want to build a large-scale parallel file system, sufficient for large scientific applications, that operates across this link.
Argonne will be leading efforts to provision Eucalyptus, an open source cloud infrastructure similar to Amazon's.
Pete Beckman, director of ALCF, said that Argonne has already started doing some research with EC2, and applications will be ready to test on it when the machines goes live.
"By building a cloud, we can figure out where the commercial offerings might make sense and where our own might make sense," he said.
One of their goals is to improve the message passing interface (MPI) technology at the heart of most parallel applications, Beckman said. Almost all scientific applications are built on MPI and the programs using that interface are sometimes limited as to how fast they can send data through it. Argonne has developed MPICH, the most popular implementation of MPI, and it hopes to tweak the technology and make it more amenable to cloud applications.
Scaling out with virtual machines
As part of its research, Argonne will focus on using virtual machines (VMs) to improve the management of HPC applications. In general, it will use the Xen Hypervisor running on each node.
"We have looked at VMs in the lab, and this will be the first time for Argonne to deploy them as a production resource for users," Beckman said. "In the past, we would create a machine, install the software and when done, say 'This is what you get access to.'"
VMs could make backing up HPC applications at the system level easier by providing an alternative to the process of checkpointing. Many HPC applications are handcrafted and run for extended periods. The process of backing up the state of an application is called checkpointing and typically needs to be done by the programmer at the application level.
But virtualization can be used to encapsulate an application, its data, and its state at the OS level -- this would allow checkpointing to be done automatically at the VM level. A downside to this, however, is that the machine will not have knowledge of the data at the VM level and will end up storing considerably more than checkpointing.
The biggest benefit of using a VM, said Beckman, is that the user can define his own software. Most users may still want to use the standard setup, but those who want to customize this environment or build a Web services infrastructure that needs to be installed or configured when the application runs can do so. The VM approach allows them to make a complete custom setup with their own compilers and message libraries and then ask for 100 instances of their specialized environment.
The main concern with using VMs in HPC is overhead. The VM can increase the amount of time it takes to move data into memory, the computer hardware and the network.
Scientists eager to squeeze every ounce of computing horsepower out of their hardware are skittish about slowing their calculations by adding another layer. Beckman said that if they can keep the VM overhead down to under 1 to 2%, it would not end up being significant. They are also curious to discover which kinds of applications incur the biggest performance penalties from a VM architecture.
Working on bare metal
Argonne is also working with technology that provisions the applications in a bare metal environment. It hopes to make this same infrastructure available as part of the cloud testing environment, as any scientists want to be able to provision new applications on raw hardware to test out new device drivers or find hardware and software bottlenecks.
But this is no easy task. "This is not as convenient as if you had a hidden version of Linux and a VM Manager like Xen," Beckman said. "VMs provide a well-constructed environment that protects you and other users. That is great for some, but not all, research. Sometimes you need to experiment right on the hardware."
At this stage, there are still more questions than answers. "It is clear that cloud computing will have a leading role in future scientific discovery," Beckman said. "In the end, we will know which scientific application domains demonstrate the best performance and what software and processes are necessary for those applications to take advantage of cloud services."
George Lawton is a contributor to SearchCloudComputing.com.