agsandrew - Fotolia

Spark brings value to big data analytics and ETL processes

Spark is garnering a lot of attention for streamlining the development of big data analytics. But, it can also improve ETL processes. Expert George Lawton explains further.

The Spark platform has been getting a lot of buzz for its ability to streamline the development of big data analytics applications in the cloud. Its real strength may lie in its ability to improve ETL processes for large enterprise applications.      

This is particularly significant for enterprises that have grown through acquisition and have a need to integrate data from different platforms, said Nick Heudecker, research director at Gartner, at the Spark Summit in San Francisco.

He pointed out that the hype for spark is high and that the platform is still in the peak of inflated expectations. Spark is gaining a lot of attention in the media, but only about 9% of enterprises have adopted it for data analytics. At the same time, enterprise data warehousing programs are being implemented by about 57% of all companies.

An enterprise architect from a large medical diagnostics firm said Spark is proving to be much faster than the commercial ETL tools they had tried in the past. The company grew by acquisition and had a need for a unified view of data in dozens of separate databases and cloud services. They were having challenges bringing this data into a centralized repository since each acquired company had different infrastructure. He said that Spark has made it practical to build a virtual data warehouse without having to centralize their data. He added that it was challenging to tune the Spark configurations to get the best performance.

Making sense of Spark

Apache spark is a distributed computing engine for data processing. A lot of the benefits are derived by its memory centric processing model. It was initially developed to make machine learning easier, but it was not designed to be a nice multi-tenant system. Owing to the needs of machine learning for normalized data sets, a lot of ancillary efforts and use cases have followed.

"Machine learning has a need for normalized data sets," said Matei Zaharia, the developer of Spark and CTO at Databricks, a vendor of a commercial version of Spark. This means that when developers build an application that takes data from different databases, it has to be transformed or normalized so that the algorithms can efficiently perform analytics. As it turns out, this is one of the core functions of ETL systems required for data warehousing.

Heudecker said the number one use case for Spark today is data integration and log processing, not machine learning. He cited one example of an enterprise that improved ETL processes where Spark reduced the time to 90 seconds from four hours. This kind of capability is significant because it means an enterprise can ask a question constantly throughout the day rather than once or twice.

Uber had an issue where the original system was built to ingest trip data, normalize it and put it into a data warehouse. This did not scale across multiple cities. Spark was one of the key elements that allowed them to bypass the legacy systems for putting data into the warehouse. This made it possible to format the data to drive new types of analytics and machine learning applications.

Consider Spark for different types of computations

One of the challenges enterprise architects face lies in optimizing data for different types of processing. As a result, many companies implement one architecture for transactions, another for operational analytics and a third for business intelligence. The wide interest in Spark has led to the development for these different types of computation on top of unified underlying data architecture.

The Spark platform addresses batch, interactive and real-time use cases. Before Spark, an enterprise would need to implement three separate platforms for these different use cases. In addition to that, machine learning options are included with Spark and are available in a wide variety of implementations on standalone clusters, in conjunction with Hadoop infrastructure, or as a cloud service.

The Spark stream processing infrastructure allows enterprises to perform analytics across streams of data in batch, which allows applications to apply batch-oriented analytics methods to data in motion. Spark SQL allows developers to do interactive analytics on SQL data sources. Heudecker said this enables more than what organizations are used to thinking about with SQL and siloed data access.

Spark also supports graph computation, which is good at identifying the links between entities described in large data sets. It's commonly used in social network analysis, recommendation engines, and fraud detection. Heudecker said he does not hear much about graph computation being adopted by enterprises today, but it is rising in importance.

Consider an in-memory integration tier

A number of tools are emerging to enable new architectures to leverage Spark integration. For example, Alluxio has developed an open source memory-centric cache that works in conjunction with Spark. This allows the enterprise to create an index of metadata describing data stored throughout the enterprise, which can be queried more quickly.

Traditionally, enterprises would put data into a big cluster to derive value, but would lose the historical context of the data because it had been moved. Spark uses metadata to tag these different sources of data to provide the concept of a just-in-time data warehouse. Heudecker said, "This is more than a data warehouse, this is a data warehouse with analytics." Many companies are built from acquisition and will not get rid of separate data warehouses. They can use Alluxio as a repeater station.

The Chinese search giant Baidu has leveraged this kind of approach to speed its query processing for data stored across servers throughout China. It would take four to eight hours to process specific complex queries. By implementing a virtual integration tier on top of Alluxio, Baidu reduced query time to 10 seconds.

Start small to learn the basics          

Heudecker said there are some significant challenges for Spark. Enterprise architects will have to plan for a rapid pace of update, which may break existing applications. There is a new release of Spark every 45 days compared to the yearly cadence of traditional ETL tools.

"This is impossible to cope with unless you are already in the cloud and this makes it difficult to keep up from an enterprise support perspective. If you are thinking about going down this path, you will have to embrace this incredibly volatile project until it matures and stabilizes," he said.

Heudecker recommends that enterprises start by talking with existing vendors to find out about their plans for Spark and how that can help with individual ETL processes. Most database vendors are incorporating Spark integration into their platforms. He also cautions against rolling Spark out to full production.

"A limited approach is more viable particularly where it can offer the most impactful advantages," Heudecker said.

Next Steps

Q&A: Apach Spark is entering a new phase of adoption

How well do you know Spark? Take this quiz to find out

Is Spark technology enterprise-ready?

Dig Deeper on Big data, machine learning and AI