We’ve pointed out that Apache Spark and Hadoop MapReduce are two different Big Data beasts: the former being a high-performance in-memory data-processing framework; and the latter a mature batch-processing platform for the petabyte scale. We also know that Apache Hive and HBase are two very different tools with similar functions. Hive is an SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database on Hadoop.
But what about Spark vs. Tez?
On paper, Spark and Tez have a lot in common: both possess in-memory capabilities, can run on top of Hadoop YARN and support all data types from any data sources. So, what’s the difference?
Apples vs. Oranges
This is how each framework brands itself:
“Apache Spark is a fast and general engine for large-scale data processing.” (source)
“The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN.” (source)
Considering the fact that Spark also uses directed-acyclic-graphs, don’t they sound a bit similar? Maybe. Nonetheless, in an interview with Shaun Connolly, Hortonworks product strategy vice president, he differentiates between the two by saying that Spark is a general purpose engine with APIs for mainstream developers, while Tez is a framework for purpose-built tools such as Hive and Pig.
Although Spark boasts an ease of use with friendly APIs that are available for Python, Scala and Java, there are some caveats when writing Spark jobs. For instance, if you work with flat files, you’ll have to write your own functions to do aggregations. That’s because Spark can only pull rows, but not separate them into columns.
Also, both frameworks claim to support Pig and Hive. We tried running Pig on Spark using the Spork project, but we had some issues. This may mean that full Pig support for Spark is still under construction—we’ll try again in the near future.
The YARN Spin
One major difference, is that Spark can run as a standalone or on top of Hadoop YARN, while Tez can only run on top of YARN—Tez was designed to do so. Spark is YARN compatible, but in practice, it can’t run alongside other YARN applications. At least at the moment.
Gopal V, one of the Tez developers, wrote an extensive post about why he likes Tez. He concludes that:
“Between the frameworks I've played with, that is the real differentiating feature of Tez - Tez does not require containers to be kept running to do anything, just the Application Manager running in the idle periods between different queries. You can hold onto containers, but it is an optimization, not a requirement during idle periods for the session.”
By “frameworks” he also means Spark—its containers need to keep running and hog resources even when they aren’t processing any data. Tez containers, however, can shut down as soon as they are finished and release the resources.
Most chances are that you use Hadoop-based applications anyway like Hive, HBase or even classical MapReduce. So you can install Spark on any Hadoop cluster, but you may run into resource management issues. On the other hand, Tez could fit quite nicely into your YARN architecture, resource management included.
Performance According to Whom
Spark claims to run 100× faster than MapReduce. Benchmarks performed at UC Berkeley’s Amplab show that Spark runs much faster than Tez (Spark is noted in the tests as Shark, which is the predecessor to Spark SQL). However, Berkeley invented Spark. Also, these benchmarks were made over a year ago with Hive 0.12, which runs over MapReduce, while Hive 0.13, which runs over Tez, has significant performance improvements.
Hortonworks did their own benchmarks. They found that Hive 0.13 running over Tez works up to 100× faster than the previous Hive version (though quite a few test queries mysteriously disappeared). 100× faster? Sound familiar?
So Spark and Tez both have up to 100× better performance than Hadoop MapReduce. That’s nice, but is Spark faster than Tez or vice versa? Who knows. If you ask someone who works for IBM they’ll say that the answer is neither: IBM Big SQL is the fastest. We really need a third party to run independent performance tests and settle the score, once and for all.
Perhaps it comes down to politics and popularity, a clash of the Big Data titans with Hortonworks rooting for Tez and Cloudera rooting for Spark. The latter is more widespread since it’s available in various distributions, while the former is only available in Hortonworks’ distro. The users may decide the frameworks’ fate. At the moment, Spark is winning the race by far, at least according to Google Trends:
Maybe after the hype has faded, after people have gained more experience working with both Spark and Tez, we’ll finally be able to tell who will become the heir to the MapReduce crown.