valentinT - Fotolia

Would Azure Data Factory benefit my cloud data?

Azure Data Factory helps move large volumes of data between the cloud and on-premises environments. But how else could it impact my big data strategy?

To distinguish itself from competitors such as Google and Amazon Web Services, Microsoft rolled out a number of cloud services focused on machine learning, big data and analytics. For example, Microsoft recently unveiled Azure Data Factory, a workflow system for coordinating data flows between storage and processing systems.

AWS offers Data Pipeline, a comparable service to Data Factory, while Google offers Google Cloud Dataflow. And while all three services are designed to streamline repeated data movement operations, Azure Data Factory has a unique lineup of services for enterprises to consider.

Azure Data Factory serves some of the functions of an extraction, transformation and load (ETL) tool, but is especially designed to move large volumes of data between cloud and on-premises resources. Developers can create data pipelines using an Azure Data Factory console or PowerShell scripts.

Data Factory performs a number of "activities," or processes that take a data set as input and produce an output data set. The basic activity within Azure Data Factory is the Copy Activity, which supports a range of sources, including Azure Blob Storage, Azure SQL Database, Azure Table Storage, on-premises or infrastructure as a service SQL Server databases and Oracle databases. The Copy Activity supports some transformations, as well.

Azure Data Factory is especially well-suited for big data applications and analysis. For example, HDInsight Activity allows developers to work with Pig -- a high-level, declarative data manipulation language in the Hadoop ecosystem -- and Hive, a Hadoop database. Users can store data in a data hub for further processing.

Users configure Azure Data Factory jobs with JSON specifications, including inputs, outputs, transformations and policies. Transformations can take advantage of Data Factory date, time and text functions.

The Azure Management Portal provides access to key information about Data Factory processes and workloads. Administrators can view information on data sets and linked services, along with activity run details.

Azure Data Factory pricing is based on activity frequency. Low-frequency activities in the cloud start at $0.30 per activity, while on-premises activities cost $0.75 per activity. Microsoft does not charge for the first five low-frequency activities performed each month.

High-frequency activities start at $0.50 for cloud and $1.25 for on-premises environments. Microsoft offers volume discounts based on the number of activities performed each month.

About the author:
Dan Sullivan holds a master of science degree and is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.

Next Steps

Azure SQL upgrades further Microsoft's big data push

Five tips to manage big data in the cloud

Google answers Amazon Glacier with its Nearline cloud storage

Dig Deeper on Google and other public cloud providers