Sergey Nivens - Fotolia


How the cloud complicates the data processing pipeline

As apps move to a cloud infrastructure, it's difficult to answer key questions about data acquisition and placement. These guidelines will help you choose the best approach.

Cloud services have given application developers and IT architects an unprecedented level of choice and flexibility...

when planning a system's information architecture and data placement. At the same time, the ascendance of mobile devices as the preferred application target, joining PCs and browsers, means the same data often must be reprocessed, formatted and sent to many different destinations.

The cloud is also a home to third-party data sources that are increasingly used as data input. Together, these trends make planning the data processing pipeline a complex and often confusing endeavor. As applications move from enterprise to cloud infrastructure, it's difficult to answer key questions about data acquisition and placement such as:

  • data access methods;
  • data format and appropriate cloud services;
  • where and how to temporarily stage, extract and transform data during application processing; and
  • where and how to store data for application presentation and archiving

In a sense, the data processing pipeline for today's applications resembles an hourglass, with many different sources funneling into a processing engine that feeds many different outputs. The eventual data source and placement design will be highly dependent on your application architecture. However, some general guidelines can make for better decisions when addressing the data processing pipeline.

Types of data and compatible cloud storage services

How and where you store data depends on its type and source. Typically, applications use one or more of the following data types:

  • Structured: Data that is tightly organized into one or more spreadsheet-like tables in which each entry, or row, has a unique key. Multiple tables and the relationships between them are aggregated into a relational database. Commercial relational databases are the backbone of most enterprise applications and implemented as cloud services such as Azure SQL, AWS RDS or Redshift. Relational databases also can be implemented on top of a raw block storage service such as AWS EBS or Google Persistent Disk.
  • Unstructured: Data such as text or media with little or no logical relationship between items and thus doesn't fit into a defined data model. Typically, such data is stored in flat files, object stores or NoSQL databases. Appropriate cloud storage services include AWS S3, DynamoDB, or Google Cloud Storage and Bigtable.
  • Files: Built by partitioning block storage and often used for text data such as user documents, event logs or VM system images. Cloud services include AWS Elastic File System and Azure Files.
  • Streaming: A form of unstructured data characterized by high-volume flows of real-time information such as event logs or media. Typically, streaming data is stored in system files or raw block devices. Cloud services optimized for streaming data include AWS Kinesis, Azure Stream Analytics and Google Cloud Pub/Sub.
  • Big data: A catch-all for very large datasets containing both structured and unstructured information and typically stored and analyzed using scalable, distributed systems such as Hadoop, BigQuery or Apache Spark. All the major infrastructure as a service (IaaS) providers offer big data services including AWS Elastic MapReduce, Google Big Query and Dataproc, or Azure HDInsight.

Incorporating internal and third-party sources in the data processing pipeline

Internal data sources come in all forms, but the most common are relational databases, files and event streams from files. Access methods include ODBC or JDBC, for databases; named pipes; file sharing or other network protocols, e.g., syslog, SNMP; or APIs using JSON or XML. In the cloud most of these go out the window unless you're using a virtual private network (VPN) to the cloud, e.g., AWS VPC.

Instead, enterprise apps updated for the Web and cloud era typically include RESTful APIs over HTTP for data access, as do almost all third-party sources. Indeed, it's difficult to access third-party data repositories without an API that uses secure sockets layer (SSL) over the public Internet and hence doesn't require a direct VPN connection.

Putting it all together: Data processing pipelines

Good examples of using several different data types and cloud storage services are N-Tier applications performing some form of extract, transform, load (ETL) operations. For an overview of a simple cloud implementation of an N-Tier app that includes both object and block storage, see the article on calculating the true cost of AWS application development. For a more complex example, consider an online gaming system that might include object file repositories for game content and logs, a NoSQL database for player profiles and state management and a Hadoop or MapReduce cluster for usage analysis and player stats.

Among the many presentations at AWS re:Invent 2015 was the more complex real-world example of cloud data management from Coursera, an online education service. Coursera described its architecture for large-scale ETL data flows that uses a variety of AWS services, including RDS, S3, EC2, EMR and Redshift, that are stitched together using AWS Data Pipeline, a service designed to automate complex data access and processing workflows, and Dataduct, an open source framework that lets you build reusable Data Pipeline jobs.

The key point for application developers is that the rich variety of IaaS data services makes it relatively easy to aggregate data from a variety of sources and data types, process it using custom code and big data -- Hadoop, MapReduce -- systems and build a data warehouse that can be used to feed a variety of applications and generate custom reports. Note that Azure provides similar ETL capabilities through its Data Factory service.

Translating existing application data models to the cloud can be a simple process if only files and SQL databases are involved. However, the plethora of cloud data services complicates the process for sophisticated applications aggregating multiple data types from different sources and using an ETL process to generate derived data sets. In this case, developers and system architects should consider using intermediate data stores for big data processing and warehousing and automating the data flow via services like Data Pipeline (AWS), Data Factory (Azure) or Cloud Dataflow (Google).

Next Steps

Considerations for cloud data storage

Choosing a cloud storage system

Evaluating IoT data storage

Dig Deeper on Cloud application development