valentinT - Fotolia

Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Cloud storage options: Object storage systems or Hadoop?

Object storage systems and Hadoop are both viable cloud storage options. But consider the latter when performing big data analysis in the cloud.

With so many options to choose from, selecting the best cloud storage system for your applications isn't always easy. But one option to consider -- especially if your apps require access controls -- is an object storage system. Those using the cloud for big data analysis should also consider Hadoop.

Object storage systems such as AWS S3, Microsoft Azure Blob and Google Cloud Storage allow you to store arbitrary objects in a persistent, highly durable and highly available system that is independent of virtual machine instances. Applications and users can access data in object stores using simple APIs; these are typically based on the representational state transfer (REST) architecture, but there are programming language-specific interfaces as well.

Object stores provide access controls to limit operations on data. Data administrators can apply access controls at the bucket level (analogous to a directory) or the object level (like a file in a directory). Authentication/authorization for object stores is managed by a cloud provider's identity management system or your directory service. With the latter, you might have an on-premises directory synchronized with a cloud-based directory service to consolidate all access control roles and privileges into a single repository.

Those using the cloud for big data analysis have additional options to consider when it comes to storage. For example, AWS offers Elastic Map Reduce (EMR), a Hadoop service.

Hadoop was designed to work with its file system, known as HDFS.

When users create a Hadoop cluster using EMR, they can copy data from AWS S3 or some other data store to the HDFS on the cluster, or they can access data directly from S3. HDFS uses local storage and typically provides better performance than retrieving from S3, but it also requires time to copy data from S3 to HDFS prior to running the Hadoop job. If the EMR cluster will be running for extended periods and using the same data for multiple jobs, it may be worth the additional startup time to copy data from S3 to HDFS.

Cloud storage options exist to fit a wide range of needs, but finding the right type of storage for your requirements means finding a suitable balance of latency, ease of use, data integrity and cost.

Control archiving costs

Another common use case for cloud storage is archiving. This process entails copying data to durable, long-term storage for extended periods of time. Here are three items to consider when it comes to controlling archiving costs:

  • Archive data is written once and usually infrequently read. Thus, a top priority should be to limit archiving costs.
  • Object stores can be used for archiving but, unless you need low-latency retrieval, the expense may be higher than necessary.
  • AWS offers the Glacier storage service for archiving, which costs substantially less than S3. Retrieving data from Glacier can take up to several hours, so it is not appropriate for most applications.

About the author:
Dan Sullivan holds a master of science degree and is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.

Next Steps

What to choose: AWS vs. Azure vs. Google cloud storage

When to use the Hadoop framework

Implementing an object storage platform

Dig Deeper on Big data, machine learning and AI