BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
The cloud offers a growing number of options for managing big data and deploying increasingly larger data warehouses -- and these options can have a direct impact on the role of cloud admins.
For companies that prefer to use Infrastructure as a Service (IaaS) primarily for compute and storage infrastructure, database administrators can continue to manage the installation, configuration and monitoring of data warehouse components. For companies that prefer to minimize overhead or leverage existing cloud admins, hosted data warehouse services such as Amazon's Redshift may be a better option. But with these hosted data warehouses, cloud admin roles and responsibilities start to bleed into typical database admin tasks.
To illustrate the varying cloud admin responsibilities, consider three options for large-scale data warehouses in the cloud: a hosted data warehouse service with Amazon Redshift, Hive with Amazon Elastic MapReduce, and Apache Spark on a self-managed cluster in Amazon's IaaS cloud. While these are not the only possible combinations of data stores and deployment models, they represent several key issues to consider.
Amazon Redshift: A hosted data warehouse service
Amazon Redshift strives to be a complete data warehouse service, reducing overhead. Its service is based on PostgreSQL and includes monitoring and basic management operations, such as backups. The data warehouse service implements a distributed database that can be scaled either vertically (using larger servers) or horizontally (adding servers).
With Redshift, cloud admins' roles shift to ones similar of database administrators. While they are freed from some design tasks -- such as selecting a physical model -- plenty of other tasks fall to cloud administrators. For example, cloud pros must change the number and types of nodes in the data warehouse according to load on the database. And, even with a managed database service, cloud admins must perform common cloud management tasks, such as maintaining identity management and access control information, as well as meeting compliance reporting requirements.
Another wrinkle with Redshift is that as the size, scope and number of sources increase in a data warehouse, extract, transform and load (ETL) become more challenging. Amazon provides support for integrating with Amazon Simple Storage Service (S3), DynamoDB and external hosts.
Developing ETL applications can be a substantial investment. If you need to move data among multiple sources and targets, or if you may change the types of data stores you work with, consider a big data ETL tool, such as Talend. This can help minimize code rewrites if you change or add target databases.
Hive on Amazon EMR: Data warehousing with Hadoop
Running Hive in Amazon Elastic MapReduce (EMR) provides some of the advantages of managed services, but leaves plenty of work for administrators. In addition to designing the Hive logical data model, cloud administrators must manage data loads, design cluster configurations and monitor workloads -- duties typically assigned to database admins.
Hive is a data warehouse system built on Hadoop and supported by EMR. Hadoop was designed to use the Hadoop Distributed File System, and so admins will need to decide how their data warehouse will use HDFS and S3 to store data. Data that resides in or outside Hadoop can be copied using the DistCP facility, or, if data is in S3, the S3DistCP facility. Since HDFS data is not persisted after a cluster is shut down, administrators will need to plan workflows that copy data to S3 or another persistent data store before terminating a cluster.
Cloud admins that use EMR will also need to determine the configuration of their Hadoop clusters. This includes determining the types of instance types to use and the location of source programs, log files and data sources and SSH keys. More importantly, administrators will need to monitor and debug workflows, such as data loads and long-running batch jobs.
When multiple workflows run on the same data sets, administrators can optimize the scheduling of workflows to minimize ETL overhead and avoid unnecessary startups and shutdowns of the Hadoop cluster.
Apache Spark: The do-it-yourself model
Using the cloud primarily for compute and storage resources requires the most database administration skill of a cloud administrator, but it also provides the most flexibility to use newer platforms, such as Apache Spark.
Data warehouses that must support interactive querying must look beyond the write-intensive MapReduce processing model of Hadoop. Apache Spark is an in-memory analytics engine for big data analysis with support for machine learning, graph processing, streaming and interactive querying. But because Spark is an incubator project at Apache, it can require more administrative overhead than Redshift or Hive on EMR.
Spark developers provide scripts for running Spark on clusters in Amazon Elastic Compute Cloud, and Amazon has a tutorial on running Spark and Shark (Hive on Spark) in EMR. Cloud administrators are responsible for the full range of administration tasks, from configuring clusters, setting Spark parameters, managing data loads, setting access controls, monitoring jobs and troubleshooting.
About the author:
Dan Sullivan holds a Master of Science degree and is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence, and worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.