beawolf - Fotolia

Manage Learn to apply best practices and optimize your operations.

Insiders offer advice on big data in the cloud

Moving big data into the cloud can be complex and challenging. To ease your journey, experts offer advice for evaluating cloud platforms, services and tools.

Are you ready to move big data into the cloud? Here's what to think about as you evaluate cloud platforms, services, tools and more, based on advice from systems integrators, channel partners, service providers and analysts.

Know where you're going: Many big data efforts in the cloud fail to meet their costs and revenue goals and can be a source of huge corporate frustration. Success takes serious planning, said Mike Maciag, COO of Altiscale. "You are going to have to plan to make sure you have the right hardware in place, the right software, and the third piece is managing the operations side of things," he said. "You probably don't have the expertise to run things in-house, so that means you really need a plan in place to make sure you can find it."

No single answer: There really is no single "killer app" for big data in the cloud, said Qubole co-founder and CEO Ashish Thusoo. "There is no uber engine out there that is going to solve all of your big data needs," he said. "The best way to think through this problem is to realize that the better approach is probably going to be some kind of integration."

Stay where you are: Big data is complicated enough without reinventing the wheel. "It all comes down to data gravity," said Mike Gualtieri, an analyst with Forrester Research. "If your website is hosted on AWS, stay there and use the AWS tools. If you've got an affinity with Azure, you can do big data there. There really isn't an enormous amount of difference between the Hadoop offerings. What's compelling is that your company has developed an affinity with an existing cloud ecosystem. So, if you're there, stay there. It will make everything easier."

A need to make predictions: Once you have a plan and have decided where to settle your big data in the cloud, the next question is what do you need to do with your data? If your company needs to use customer behavioral data in the past and present to make predictions, then you'll need a big data solution that will do predictive analytics. The advice: Look for a Hadoop service or Hadoop data processing platform.

Narrow big data needs: Perhaps your big data needs are narrower, with line-of-business groups needing quick but not necessarily detailed access. A data scientist's slice of big data can be an easier task to solve in the cloud -- think fast self-service -- and in this case, Forrester's Gualtieri suggested picking a data analytics tool you like and then seeing which clouds it runs on. He mentioned RapidMiner (which is on AWS), but noted that SAS, IBM and others also have strong offerings.

Program language R: If you have in-house expertise in the very popular open source program language R, the hands-down best choice is Azure. The Microsoft platform uses R for its back-end analytics (in fact, the company just acquired Revolution Analytics, which developed the commercial version of R, and the company's machine learning services are designed with R in mind. "This is the best possible cloud platform for running R," Gualtieri said.

Reporting analytics: If your big data needs involve sifting through lots of historical data and the traditional data center is too slow, Gualtieri suggested RedShift (for AWS), Azure SQL Server for analysts or Google's Big Query. All, he said, are better, faster choices than doing it the old way.

Impala vs. Spark vs. Hive: Which Hadoop iteration is the right one for you? And when might you need to leave Hadoop in the dust? Thusoo offered recommendations. For SQL use cases, his pick is Impala, which is Cloudera's open source, massively parallel processing tool for Hadoop. More complex SQL tasks, such as data pipelines, should turn to Hive (part of the Hortonworks data platform) as the platform of choice because of the built-in fault tolerance. And for machine learning, Thusoo's pick is Spark, an open source cluster computing option.

Software as a service: Overwhelmed with choices? It's easy to be. Qubole offers a big data software-as-a-service platform that runs on AWS, Google and Azure, and the goal is to make operations and consumption simpler, Thusoo said. The company set out to solve two major big data issues -- a lack of operational expertise and a lack of proper tools to create a self-service environment. "We've created a self-managing platform that leverages the elasticity of the cloud and does not require a large operational footprint," he said.

Altiscale is another option. The Hadoop-as-a-service provider runs its own cloud with hardware designed specifically to support the parallel processing necessary to analyze big data. With Hadoop expertise in short supply worldwide -- some estimates say 88% of possible consumers have no access to Hadoop programmers -- Altiscale COO Maciag said it just made sense to create a cloud service that would do one thing -- Hadoop -- very well. "We want customers to focus on analysis," he said. "On their own, businesses right now are simply not getting the dial tone going."

Next Steps

Choosing the best big data cloud platform

Managing big data in the cloud

The future of cloud and big data

Dig Deeper on Big data, machine learning and AI