LAS VEGAS - Carl Westphal, IT director for the Translational Genomics Research Institute has tried squeezing large...
data sets through public pipes and it's no fun, he said. Genomic sequencing at TGen routinely produces multi-terabyte image sets which are processed at the Arizona State University "Saguaro 2" supercomputer.
Westphal said his researchers chafe at transferring data from TGen facilities to Saguaro 2 even over a dedicated gigabit Ethernet link. TGen's raw images are 3-4 terabytes apiece, and return at least a terabyte of results to be stored and analyzed by researchers.
At that scale, a GigE link can seem like a very narrow pipe. Consequently, said Westphal, TGen labs is rolling out its own virtualized clusters out of off the shelf hardware, to speed up results and feed an insatiable demand for compute resources. In his sector, Westphal said, "I can't see how public cloud is possible until they fix the data management issue."
Speaking on a panel on data management architectures at Interop, Michelle Munson, CEO of file transfer optimizer Aspera, pinpointed bandwidth as the biggest bottleneck facing cloud computing to date. She said that bandwidth will determine the ability of large consumers to move into the cloud cost-effectively, and said for many, it simply won't be worth it. Data strategies should focus on reducing transmission in and out of the cloud as much as possible, according to Munson.
Data that comes from the cloud, like e-commerce and web analytics can stay in the cloud and be processed cost-effectively there, but moving extremely large data sets in and out of the cloud for processing services like Amazon's Elastic MapReduce, would be a boondoggle.
"Nodes on [Amazon] AWS are good for 250 Mb/sec to disc- we've measured it", Munson said. That means that even if a user has the bandwidth to support those speeds, the best they could do would be about 80GB per hour. "There are cases" Munson added, "where it's still cheaper to ship drives" with data on them than to upload it.
Amazon is trying to combat this bottleneck for consumers with a new data loading service, where drives are mailed to AWS and uploaded into a customer's S3 account. This is not useful for users who need to get all that data back out again, of course, or only need it there for a short period.
"The math is fairly simple," Munson said. To determine whether public cloud resources are cost effective- users should measure bandwidth against data and determine how long it's going to take. For a lot of companies, that's plenty, and many IT shops don't need data crunched, they just need it available.
However, one of the fastest growing consumers of computing power is the research community, where medical imaging, genome sequencing and star survey data consume astounding amounts of data before and after being analyzed.
Munson thinks that research facilities with these extreme computing needs will find it far more cost-effective to set up private clouds using open source tools rather than go to a public cloud like Amazon or purchase private cloud infrastructure from commercial vendors. She discussed an Aspera customer who routinely transfers terabyte data sets between Cambridge, UK and the National Institute of Health in the US. It takes 12 hours a shot and they wouldn't do it if they didn't need to share their work, she said. Munson sees a future for research data making it into the cloud after the fact, but says computing power has to go where it's needed for now, not the other way around.