carloscastilla - Fotolia
Everyone who's looked at cloud computing or big data has probably heard of Hadoop, an open source implementation of the Web-search MapReduce algorithm. Some have proposed that Hadoop become the standard for the future, but for most business applications, Hadoop has serious limits. Now, there's a new game: Apache Spark. Cloud and big data planners need to understand what Spark could bring to data-intensive applications and how it could impact development planning. This means understanding the differences between Hadoop and Spark, recognizing how to ensure Spark implementations retain its benefits, and planning for a Spark-driven future.
Even proponents of Hadoop agree it has three problems. First, programming data access is complicated. Second, performance limitations make it hard to use Hadoop for real-time applications. And third, unstructured data isn't what businesses are about. It's this latter problem that probably poses the greatest challenge to Hadoop adoption.
Hive helps Hadoop, but not enough
All businesses have unstructured data, but most business analytics are based on transaction analysis -- and transactions are almost always structured. Further, business applications use relational database (RDBMS) access most often, and make most database requests through the Structured Query Language (SQL). Hadoop doesn't support RDBMS/SQL in its basic form; this is provided through an add-on, called Hive.
Hive's SQL support can reduce the problems of Hadoop programming, but Hive only complicates the other classic Hadoop problem -- performance. Applications using Hive and Hadoop can run 10s or hundreds of times longer than true RDBMS/SQL applications, and this can make Hadoop impractical as a single cloud and big data database service.
Spark is, in many ways, a successor to Hadoop, although it will run on top of Hadoop's file system and can share cluster data with it. Spark has native APIs for the popular Web programming languages -- including Java and Python -- and it also has native SQL capability. But the big difference between Spark and Hadoop is performance. Although the speed improvement Spark can bring will vary depending on the specific application, there are reports of improvements of 10 to almost 300 times.
Spark's magic lies in how it processes queries
A big part of Spark's magic lies in how it processes queries. Typically, Hadoop operations are disk to disk. This means that each stage of a Hadoop application stores its results on disk, and the next phase must then access it. Spark is designed for in-memory operation, which makes it much faster -- particularly for complex Hadoop applications, including Hive queries.
Spark also restructures information. It's built around the notion of a Resilient Distributed Dataset (RDD). RDDs are permanent data structures that can be built or changed only under specific rules, and they allow Spark to know what and when to cache, as well as how to provide backups. The RDD structure makes Spark more efficient, but there's still more.
Spark has a schema to describe a relational structure, called Data-Frame. By thinking of data organization in data frame terms, it's possible to facilitate the use of SQL queries in Spark, as well as to use DataFrame APIs -- in Python or Spark's Scala language -- to develop applications. These are actually faster than ones that exploit Spark's basic RDD model.
SQL queries have a special optimization in Spark, using the Catalyst component. Catalyst operates to transform SQL queries to physical plans for access. This process resolves the queries, matches column names and so on. Spark data storage and access -- via the Scala advanced programming language -- uses the same core features as Catalyst. Further Catalyst details are available, and professional database architects should consider reviewing the way that Catalyst works to help them optimize their query structures and improve performance.
You need to protect the benefit of Spark
The Spark benefit has to be protected, though. It is crucial that Spark be supplied with ample memory, or the majority of its performance benefits will be lost. Users also report that best Spark performance will mean more CPU cores and higher-speed network connections than Hadoop would require. Generally, the benefits of Spark will overcome the cost of these performance enhancements, but moving from Hadoop to Spark may require server upgrades to be effective.
Spark cluster design also can be important to Spark success. Performance improvements are maximized if the results for a given cluster will fit into memory, so dividing information storage by type or activity can help ensure that will happen. As with all cluster-database technologies, you'll want to consider the benefits of parallel cluster query processing when you assign data to clusters.
The performance enhancements Spark brings can have a profound impact on its utility in a business, which is what makes Spark such a potential game changer. Complex analytics tasks -- particularly those based on structured data storage -- can be so much faster with Spark that they become practical, which could allow some structured data to migrate from an RDBMS into Spark. Access and update functions can be fast enough to allow Spark to be used in real-time applications and not just analytic batch tasks, where Hadoop is typically used.
Spark's SQL efficiency could make it a popular option
For cloud and big data planners and architects, Spark's SQL efficiency could make Spark a populist option for data access and analytics -- and even a tool in application development. Hadoop is difficult to use to the point where a knowledge worker is unlikely to learn it quickly, and Hive performance is often a major barrier to SQL use. With Spark, you can expect knowledge workers and developers to build SQL queries for big data access, eliminating the learning curve -- and errors -- associated with direct access to more complex Hadoop APIs. Even API access to Spark's RDD structure is easier to learn -- and teach -- than Hadoop programming. It's hard not to see Spark as the future of big data.
Open source activity bears that out. Spark is the most popular project in the open source community, among the largest in contributor base and total contributions. It's gaining corporate sponsorship at an astounding rate. If big data, cloud data and business data are to meet somewhere, Spark is very likely that place.
Apache Spark versus MapReduce
How well do you know Spark?
Prepare your big data app for production