Nmedia - Fotolia
Google continues to flesh out its big data cloud services with its Dataflow project and new partnerships around Hadoop.
The latest open source expansion of the ecosystem involves a deal with Cloudera Inc. to run Google Cloud Dataflow on Apache Spark and the addition of Hortonworks HDP 2.2 to Google Cloud Platform. Both are seen as moves that make it easier for customers to carry out high-performance activities in a more controlled environment without having to piece everything together themselves.
"The agreements made by Google here should provide far easier means of installing and setting up full platforms for dealing with specific needs of different types of data systems," said Clive Longbottom, service director at analyst firm Quocirca, based in Newbury, England.
Dataflow, currently in alpha and a potential competitor to Amazon's Elastic MapReduce, is seen as a means to execute batch or streaming data pipelines. It incorporates open sourced SDKs for programming large-scale data processing and managed services that tie together various Google cloud products for executing those big data projects.
Dataflow was previously only available via "runners" on local machines or through Google's managed cloud environment. The Cloudera partnership allows customers to use Spark runners either on-premises or in the cloud and is available on Github as part of Cloudera Labs.
Dataflow is something that has drawn the interest of Google cloud customer Workiva, said Dave Tucker, senior director of platform development for the financial reporting software developer based in Ames, Iowa.
"It potentially helps us solve more of the problems we're dealing with around large amounts of data and trying to sync a lot of different processes we have," Tucker said.
Cloudera has already integrated Amazon Web Services and Microsoft Azure. Google is trying to do the same things as Amazon by adding cloud services that customers don't have to build themselves and by showing a strong commitment to portability, said Josh Wills, Cloudera founder and CTO.
"I love the idea of a company publishing cloud services with a proprietary engine, but ensuring customers could take code with them and weren't locked in," Wills said.
Dataflow can be good for unstructured data, such as geospatial data with large amounts of map information, and in fields with complex file formations, such as genomics and bioinformatics, Wills said. If Google is able to succeed with Dataflow and merge batch and real-time data it will be a "life-changer," he added.
The deal with Cloudera is a smart one, and it ensures Google maintains a balanced relationship with all three major Hadoop distributions, said Carl Olofson, research vice president for IDC, based in Framingham, Mass.
"The plan to run Dataflow on Spark with Cloudera support makes a ton of sense, enabling Google to add value to their Google Cloud Platform by enabling rapid data loading into Hadoop," Olofson said.
Overall, it looks like Google is developing "function as a service" capabilities that fit somewhere between software as a service and platform as a service, Longbottom said. It's a move in the right direction, but more must be done around messaging, he added.
"Use cases, case studies and guidance for less technical people in what this means to them and their businesses would be a welcome move," Longbottom said.
Trevor Jones is the news writer for SearchCloudComputing. You can reach him at firstname.lastname@example.org.
Geospatial data, meet Hadoop
Google big data terms you need to know
DevOps necessary to navigate big data ecosystem for data engineering