We’ve pointed out that Apache Spark and Hadoop MapReduce are two different Big Data beasts: the former is a high-performance in-memory data-processing framework, and the latter is a mature batch-processing platform for the petabyte scale. We also know that Apache Hive and HBase are two very different tools with similar functions. Hive is an SQL-like engine that runs MapReduce jobs, while HBase is a NoSQL key/value database on Hadoop.
But what about the question of Spark vs. Tez?
On paper, Spark and Tez have a lot in common: both possess in-memory capabilities, both can run on top of Hadoop YARN, and both support all data types from any data sources. So what’s the difference between Spark and Tez?
Table of Contents
What is Apache Spark?
Apache Spark is an open-source analytics engine and cluster computing framework for processing big data. Spark is the brainchild of the non-profit Apache Software Foundation, a decentralized organization that works on a variety of open-source software projects.
First released in 2014, Spark builds on the Hadoop MapReduce distributed computing framework. Spark preserves many of the benefits of MapReduce—like scalability and fault tolerance—while also improving speed and ease of use.
In addition to its core data processing engine, Spark includes libraries for SQL, machine learning, and stream processing. The Spark framework is compatible with the Java, Scala, Python, and R programming languages, winning it broad appeal among developers. Spark also supports third-party technologies like Amazon S3, Hadoop's HDFS, MapR XD, and NoSQL databases such as Cassandra and MongoDB.
Spark's appeal comes from its capacity to unite different processes, technologies, and techniques into a single big data pipeline, enhancing productivity and efficiency. Thanks to its flexibility, Spark has become a highly popular and effective "Swiss army knife" for the world of big data processing.
What is Apache Tez?
Like Spark, Apache Tez is an open-source framework for big data processing based on the MapReduce technology. Both Spark and Tez offer an execution engine that is capable of using directed acyclic graphs (DAGs) to process extremely large quantities of data.
Tez generalizes the MapReduce paradigm by treating computations as DAGs. MapReduce tasks are combined into a single job that is treated as a node in the DAG, enforcing concurrency and serialization.
Meanwhile, the edges of the DAG represent the movement of data between jobs. Tez is data type-agnostic, which means that it's concerned only with the movement of data (and not the format it takes).
By improving some of MapReduce's limitations, Tez seeks to improve the performance of data processing jobs. This added efficiency empowers programmers to make the design and development choices that they believe are best for their project.
The Differences Between Spark and Tez
Apache Spark brands itself as "a unified analytics engine for large-scale data processing.” Meanwhile, Apache Tez calls itself "an application framework which allows for a complex directed acyclic graph of tasks for processing data."
Considering the fact that Spark also uses directed acyclic graphs, don’t the two tools sound a bit similar? Maybe. But there are also important points of distinction to consider. Here are the main differences between Apache Spark and Apache Tez:
- Difference #1: Hive and Pig
- Difference #2: Hadoop YARN
- Difference #3: Performance tests
We'll go into more detail about each difference between Spark and Tez in the sections below.
Hive and Pig are two open-source Apache software applications for big data: Hive is a data warehouse, while Pig is a platform for creating data processing jobs that run on Hadoop (including on Spark or Tez).
Shaun Connolly, Hortonworks product strategy vice president, differentiates between Spark and Tez by saying that Spark is a general-purpose engine with APIs for mainstream developers, while Tez is a framework for purpose-built tools such as Hive and Pig.
While both Spark and Tez claim to support Pig and Hive, the reality isn't so clear. We tried running Pig on Spark using the Spork project, but we had some issues. This may mean that full Pig support for Spark is still under construction—we’ll try again in the near future.
The YARN Spin
YARN is Hadoop's resource manager and job scheduler. In theory, Spark can execute either as a standalone application or on top of YARN. Tez, however, has been purpose-built to execute on top of YARN. In practice, though, Spark can't run concurrently with other YARN applications (at least not yet).
Gopal V, one of the developers for the Tez project, wrote an extensive post about why he likes Tez. He concludes that:
“Between the frameworks I've played with, that is the real differentiating feature of Tez - Tez does not require containers to be kept running to do anything, just the Application Manager running in the idle periods between different queries. You can hold onto containers, but it is an optimization, not a requirement during idle periods for the session.”
By “frameworks” he also means Spark—its containers need to keep running and hog resources even when they aren’t processing any data. Tez containers, however, can shut down as soon as they are finished and release the resources.
Most chances are that you use Hadoop-based applications anyway like Hive, HBase or even classical MapReduce. So you can install Spark on any Hadoop cluster, but you may run into resource management issues. On the other hand, Tez could fit quite nicely into your YARN architecture, resource management included.
Performance According to Whom?
Perhaps the biggest question of them all—which is faster, Spark or Tez? According to various benchmarks, both options dramatically improve upon MapReduce performance; however, the winner may depend on who's doing the measuring. The jury's still out in terms of an independent third-party assessment.
Spark claims to run 100 times faster than MapReduce. Benchmarks performed at UC Berkeley’s Amplab show that Spark runs much faster than Tez (Spark is noted in the tests as Shark, which is the predecessor to Spark SQL).
Given the fact that Berkeley invented Spark, however, these tests might not be completely unbiased. Also, these benchmarks were made several years ago with Hive 0.12, which runs over MapReduce. Beginning with version 0.13, Hive uses Tez as its execution engine, which results in significant performance improvements.
Meanwhile, Hortonworks did their own benchmarks on the question of Spark and Tez performance. They found that Hive 0.13 running over Tez works up to 100 times faster than Hive 0.12 (though quite a few test queries mysteriously disappeared). 100 times faster... hmm, sound familiar?
So Spark and Tez both have up to 100 times better performance than Hadoop MapReduce. That’s nice, but is Spark faster than Tez or vice versa? Who knows. If you ask someone who works for IBM they’ll say that the answer is neither: IBM Big SQL is the fastest. We really need a third party to run independent performance tests and settle the score, once and for all.
The Bottom Line
The question of Spark versus Tez may ultimately come down to politics and popularity: a clash of the Big Data titans, with Cloudera rooting for Spark and Hortonworks for Tez. Spark is more widespread since it’s available in various distributions, while Tez is only available in Hortonworks’ distro.
In the end, the user bases may decide the frameworks’ fate. At the moment, Spark is winning the race by far, at least according to Google Trends.
Maybe after the hype has faded, after people have gained more experience working with both Spark and Tez, we’ll finally be able to tell who will become the heir to the MapReduce crown.
Originally published: March 4th, 2019