The race is on to pack the cloud with big data. But is your team ready?
Even if enterprises have pursued big data initiatives within their own data centers, it doesn't necessarily mean they're going to succeed in the cloud. And, in most cases, new training and skill sets are a must.
In general, big data in cloud computing can reduce costs compared to an on-premises deployment, said Mike Leone, senior analyst at Enterprise Strategy Group. Not every big data workload or project in the cloud will require you to have a big data expert on staff, but some -- such as those that involve Hadoop -- will.
For example, while it's fairly straightforward to replace a five-node Hadoop cluster on premises with a five-node cluster in the cloud, management challenges, especially around software interoperability, arise, Leone said.
Four skills for big data in cloud computing
IT teams should focus on four main categories of skills to succeed with big data in cloud computing, according to Manisha Sule, director of big data analytics at Linux Academy, an IT training organization.
- Administration: Knowledge of how to administer Hadoop and NoSQL becomes crucial in this role. Admins also need to carefully configure and manage infrastructure components, such as compute, storage and network, to support big data projects. Experience with Hadoop Distributed File System and NoSQL databases, both of which can store large volumes of data, is also helpful, Sule said.
- Development: A big data developer should have programming experience with languages such as Python, Scala and Java, according to Sule. In addition, experience with offerings like Amazon Web Services' (AWS) Kinesis and Lambda is a plus, as they offer alternatives for real-time processing in a microservices-based architecture.
- Analysis: Big data analysis in the cloud requires expertise in statistics, data mining, machine learning, operations research and computer programming. Data scientists and analysts, as well as machine learning and AI engineers, need to learn how to build algorithms and then automate those algorithms to work with massive sets of live data, Sule said.
- Visualization: A visualization developer designs dashboards that tell a story about the big data an organization collects. IT professionals in this role access disparate data sources and integrate them into a unified and interactive platform.
While there are classes to help master these four skills, it's best for IT pros to adopt a learn-by-doing approach, Sule said. And the cloud lends itself to that model.
"As you prepare, you can easily register yourself for a trial account to get a direct feel for the available services," she said. Many formal courses also involve hands-on experience.
IT teams should also be prepared to combine several different cloud services from their provider to support a big data initiative, said Muhammad Nabeel, principal Azure architect at Cloud Technology Partners, a consulting firm now owned by Hewlett Packard Enterprise.
"You need to know them in detail and implement them together," he said.
Across the three leading public cloud providers, the key big data services to focus on, according to Nabeel, include:
- Microsoft Azure: HDInsight for using Hadoop, Spark, R Server, HBase and Storm clusters on Azure.
- Google Cloud Platform: BigQuery for an analytics data warehouse; Cloud Dataflow for batch and stream processing; Cloud Dataproc for managed Hadoop and Spark; and Cloud Datalab for data exploration.
- AWS: Elastic MapReduce for using Hadoop and Spark; Athena for analytics in Simple Storage Service; and Elasticsearch for clusters.
In addition to third-party training options, cloud providers have helpful learning features to speed up adoption. For example, in Google Cloud Console, there is a "Try it out" feature with examples.
And Nabeel agreed that hands-on experience goes a long way. "Taking a course can help, but it isn't always clear if the course will actually address the specific knowledge you need," he said.
Look at the big (data) picture
While knowledge of provider-specific big data tools is important, organizations should also strive to diversify the skills of their team across multiple cloud platforms. Leaning too heavily on a single provider is a bit shortsighted, as there is no clear winner in market right now, said Avi Freedman, co-founder and CEO of Kentik, a provider of network traffic analytics.
"That means you could be developing a skill set that has no long-term demand," Freedman said. A better approach is to learn general concepts related to big data in cloud computing, such as distributed systems and databases.
"Once you have that, learning a given cloud service provider's implementation of the technology should be pretty easy," he said.
In addition, in any cloud environment, be sure to thoroughly understand all the different ways you will access and use that data, ranging from application type to the kind of data being stored, said Cassie Dennis, director at Monday Loves You, a digital marketing agency.
"If the person or team who developed the connections didn't understand the [business] needs up and down the process, this will be ugly fast," she said. As with any new IT project, inquisitiveness and good judgment go a long way.