Spark vs. Tez Key Differences
- Tez fits nicely into YARN architecture. Spark may run into resource management issues.
- Spark is more for mainstream developers, while Tez is a framework for purpose-built tools.
- Spark can't run concurrently with YARN applications (yet). Tez is purposefully built to execute on top of YARN.
- Tez's containers can shut down when finished to save resources. Spark's containers hog resources even when not processing data.
Let's talk about the great Spark vs. Tez debate. First, a step back; we’ve pointed out that Apache Spark and Hadoop MapReduce are two different Big Data beasts. The former is a high-performance in-memory data-processing framework, and the latter is a mature batch-processing platform for the petabyte scale. We also know that Apache Hive and HBase are two very different tools with similar functions. Hive is a SQL-like engine that runs MapReduce jobs, while HBase is a NoSQL key/value database on Hadoop.
But what about the question of Spark vs. Tez?
On paper, Spark and Tez have a lot in common. Both possess in-memory capabilities, both can run on top of Hadoop YARN, and both support all data types from any data sources. So what’s the difference with Spark vs. Tez?
Table of Contents
- What is Apache Spark?
- What is Apache Tez?
- The Differences Between Spark and Tez
- Do Spark and Tez Support Pig and Hive?
- Using YARN with Spark and Tez
- Spark vs. Tez: What's Faster?
- The Bottom Line
- How Xplenty Can Help
What is Apache Spark?
Apache Spark is an open-source analytics engine and cluster computing framework for processing big data. Spark is the brainchild of the non-profit Apache Software Foundation, a decentralized organization that works on a variety of open-source software projects.
First released in 2014, Spark builds on the Hadoop MapReduce distributed computing framework. Spark preserves many of the benefits of MapReduce—like scalability and fault tolerance—while also improving speed and ease of use.
In addition to its core data processing engine, Spark includes libraries for SQL, machine learning, and stream processing. The Spark framework is compatible with the Java, Scala, Python, and R programming languages, winning it broad appeal among developers. Spark also supports third-party technologies like Amazon S3, Hadoop's HDFS, MapR XD, and NoSQL databases such as Cassandra and MongoDB.
Spark's appeal comes from its capacity to unite different processes, technologies, and techniques into a single big data pipeline, enhancing productivity and efficiency. Thanks to its flexibility, Spark has become a highly popular and effective "Swiss army knife" for the world of big data processing.
What is Apache Tez?
Like Spark, Apache Tez is an open-source framework for big data processing based on the MapReduce technology. Both Spark and Tez offer an execution engine that is capable of using directed acyclic graphs (DAGs) to process extremely large quantities of data.
Tez generalizes the MapReduce paradigm by treating computations as DAGs. MapReduce tasks combine into a single job that is treated as a node in the DAG, enforcing concurrency and serialization.
Meanwhile, the edges of the DAG represent the movement of data between jobs. Tez is data type-agnostic, which means that it's concerned only with the movement of data (and not the format it takes).
By improving some of MapReduce's limitations, Tez seeks to improve the performance of data processing jobs. This added efficiency empowers programmers to make the design and development choices that they believe are best for their project.
Apache Spark brands itself as "a unified analytics engine for large-scale data processing.” Meanwhile, Apache Tez calls itself "an application framework which allows for a complex directed acyclic graph of tasks for processing data."
Considering the fact that Spark also uses directed acyclic graphs, don’t the two tools sound a bit similar? Maybe. But there are also important points of distinction to consider. Here are the main differences between Apache Spark and Apache Tez:
- Difference #1: Hive and Pig
- Difference #2: Hadoop YARN
- Difference #3: Performance tests
We'll go into more detail about each difference between Spark and Tez in the sections below.
Enjoying This Article?
Receive great content weekly with the Xplenty Newsletter!
Do Spark and Tez Support Pig and Hive?
Hive and Pig are two open-source Apache software applications for big data. Hive is a data warehouse, while Pig is a platform for creating data processing jobs that run on Hadoop (including on Spark or Tez).
Shaun Connolly, Hortonworks product strategy vice president, differentiates between Spark and Tez by saying that Spark is a general-purpose engine with APIs for mainstream developers, while Tez is a framework for purpose-built tools such as Hive and Pig.
While both Spark and Tez claim to support Pig and Hive, the reality isn't so clear. We tried running Pig on Spark using the Spork project, but we had some issues; it appears that the use of Pig on Spark, at least, is still iffy at best.
Using YARN with Spark and Tez
YARN is Hadoop's resource manager and job scheduler. In theory, Spark can execute either as a standalone application or on top of YARN. Tez, however, has been purpose-built to execute on top of YARN. In practice, though, Spark can't run concurrently with other YARN applications (at least not yet).
Gopal V, one of the developers for the Tez project, wrote an extensive post about why he likes Tez. He concludes that:
“Between the frameworks I've played with, that is the real differentiating feature of Tez - Tez does not require containers to be kept running to do anything, just the Application Manager running in the idle periods between different queries. You can hold onto containers, but it is an optimization, not a requirement during idle periods for the session.”
By “frameworks” he also means Spark—its containers need to keep running and hog resources even when they aren’t processing any data. Tez containers, however, can shut down as soon as they are finished and release the resources.
Most chances are that you use Hadoop-based applications anyway like Hive, HBase or even classical MapReduce. So you can install Spark on any Hadoop cluster, but you may run into resource management issues. On the other hand, Tez could fit quite nicely into your YARN architecture, resource management included.
Spark vs. Tez: What's Faster?
Perhaps the biggest question of them all—which is faster when it comes to Spark vs. Tez? According to various benchmarks, both options dramatically improve upon MapReduce performance; however, the winner may depend on who's doing the measuring. The jury's still out in terms of an independent third-party assessment.
Spark claims to run 100 times faster than MapReduce. Benchmarks performed at UC Berkeley’s Amplab show that Spark runs much faster than Tez (the tests refer to Spark as Shark, which is the predecessor to Spark SQL).
Given the fact that Berkeley invented Spark, however, these tests might not be completely unbiased. Also, these benchmarks were made several years ago with Hive 0.12, which runs over MapReduce. Beginning with version 0.13, Hive uses Tez as its execution engine, which results in significant performance improvements.
Meanwhile, Hortonworks did their own benchmarks on the question of Spark and Tez performance. They found that Hive 0.13 running over Tez works up to 100 times faster than Hive 0.12 (though quite a few test queries mysteriously disappeared). 100 times faster... hmm, sound familiar?
So Spark and Tez both have up to 100 times better performance than Hadoop MapReduce. But when it comes to Spark vs Tex, which is the fastest?
No one can say--or rather, they won't admit. If you ask someone who works for IBM they’ll tell you that the answer is neither, and that IBM Big SQL is faster than both. We really need a third party to run independent performance tests and settle the score, once and for all.
Integrate Your Data Today!
Try Xplenty free for 7 days. No credit card required.
The Bottom Line
The question of Spark vs. Tez may ultimately come down to politics and popularity. It is a clash of the Big Data titans, with Cloudera rooting for Spark and Hortonworks for Tez. Spark is more widespread since it’s available in various distributions, while Tez is only available in Hortonworks’ distro.
In the end, the user bases may decide the frameworks’ fate. At the moment, Spark is winning the race by far, at least according to Google Trends.
Maybe after the hype has faded, after people have gained more experience working with both Spark and Tez, we’ll finally be able to tell who will become the heir to the MapReduce crown.
How Xplenty Can Help
The world of data - including the Spark vs. Tez debate - can be complicated. Xplenty makes things easier.-Xplenty utilizes insights gained from both Spark and Tez to provide simple, visualized data pipelines for automated flows across sources and destinations - allowing customers to transform, normalize, and clean data while adhering to compliance best practices.
Looking for some guidance into this world of Spark vs. Tez and everything in between? Get in touch with the Xplenty team today for a chat about your business needs and objectives, or sign up for a free trial of the Xplenty platform.