Cloud platforms, such as AWS, Azure and Google Cloud, have morphed into sophisticated data science environments....
However, the cloud can also complicate data processing pipelines, as it requires analysts to build complex workflows that extract data from multiple sources, funnel it through various filters and then feed it to different data warehouses and analytics services.
To simplify data pipeline development, Google users can deploy Cloud Composer, a managed workflow orchestration service based on the open source Apache Airflow project.
Initially developed by Airbnb, Airflow automates data processing workflows that were previously written as long, intricate batch jobs. Users construct Airflow job pipelines as a directed acyclic graph (DAG), written in Python, to dynamically instantiate them via code.
Airflow's primary components are:
- the Python job definitions;
- a command-line interface to run, pause, schedule and test jobs, along with various commands to manipulate the DAG, metadata and variables;
- a web UI to visually inspect DAG definitions, dependencies and variables; monitor log files and task duration; and review source code;
- a metadata repository in a NoSQL (MySQL) or relational (Postgres) database to maintain persistent data and track job status;
- a set of worker processes to execute workflow tasks; and
- a scheduler to instantiate workflows ready to run.
Google Cloud Composer features
Cloud Composer provides functionality similar to that of infrastructure-as-code services, such as Google Cloud Deployment Manager or AWS CloudFormation, for operations teams. It includes a library of connectors, an updated UI and a code editor.
Google Cloud Composer also includes:
- full integration with various Google Cloud Platform (GCP) data and analytics services, such as BigQuery, Cloud Storage and Cloud Machine Learning Engine;
- support for Stackdriver logging and monitoring;
- simplified interfaces for DAG workflow management and configuration of the Airflow runtime and development environments;
- client-side developer support via Google Developer Console, Cloud SDK and Python package management;
- access controls to the Airflow web UI using the Google Identity-Aware Proxy;
- support for connections to external environments, both on premises and on other clouds; and
- compatibility with open source Airflow; and
- support for other community-developed integrations.
Composer basics and pricing
The Composer Airflow environment uses Google Compute Engine instances to run as a Kubernetes cluster. During configuration, users specify the instance type, node count, VM disk size and various network parameters and can optionally set up email notifications with the GCP SendGrid service.
As mentioned above, Google Cloud Composer workflows are described as DAGs, which are a set of tasks to be run, as well as their order, relationships and dependencies. To cite an example from Google, if a three-node DAG has tasks A, B and C, the workflow might specify that task A needs to run before B, but C can run whenever. A DAG might also specify constraints, such as, if task A fails, it can be restarted up to five times. Tasks A, B and C could be anything, such as running a Spark job on Google Cloud Dataproc.
Like other GCP services, Google Cloud Composer pricing is based on resource consumption, which is measured by the size of its environment and the duration of workflow operations. Specifically, users are billed per minute based on the number and size of web server nodes, database storage and network traffic. Users pay for Composer, as well as for the underlying Kubernetes nodes, which run the Airflow worker and scheduler processes, and for Google Cloud Storage buckets, which store DAG workflows and task logs. To more accurately estimate your costs, refer to the GCP documentation's pricing example.