For years, Hadoop was the undisputed champion of big data—until Spark came along.
Since its initial release in 2014, Apache Spark has been setting the world of big data on fire. With Spark's convenient APIs and promised speeds up to 100 times faster than Hadoop MapReduce, some analysts believe that Spark has signaled the arrival of a new era in big data.
How can Spark, an open-source data processing framework, crunch all this information so fast? The secret is that Spark runs in-memory on the cluster, and it isn’t tied to Hadoop’s MapReduce two-stage paradigm. This makes repeated access to the same data much faster.
Spark can run as a standalone application or on top of Hadoop YARN, where it can read data directly from HDFS. Dozens of major tech companies such as Yahoo, Intel, Baidu, Yelp, and Zillow are already using Spark as part of their technology stacks.
While Spark seems like it's bound to replace Hadoop MapReduce, you shouldn't count out MapReduce just yet. In this post we’ll compare the two platforms and see if Spark truly comes out on top.
Table of Contents
- What is Apache Spark?
- What is Hadoop MapReduce?
- The Differences Between Spark and MapReduce
- Ease of Use
- Data Processing
- Failure Tolerance
What is Apache Spark?
In its own words, Apache Spark is "a unified analytics engine for large-scale data processing." Spark is maintained by the non-profit Apache Software Foundation, which has released hundreds of open-source software projects. More than 1200 developers have contributed to Spark since the project's inception.
Originally developed at UC Berkeley's AMPLab, Spark was first released as an open-source project in 2010. Spark uses the Hadoop MapReduce distributed computing framework as its foundation. Spark was intended to improve on several aspects of the MapReduce project, such as performance and ease of use, while preserving many of MapReduce's benefits.
Spark includes a core data processing engine, as well as libraries for SQL, machine learning, and stream processing. With APIs for Java, Scala, Python, and R, Spark enjoys a wide appeal among developers—earning it the reputation of the "Swiss army knife" of big data processing.
What is Hadoop MapReduce?
Hadoop MapReduce describes itself as "a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner."
The MapReduce paradigm consists of two sequential tasks: Map and Reduce (hence the name). Map filters and sorts data while converting it into key-value pairs. Reduce then takes this input and reduces its size by performing some kind of summary operation over the dataset.
MapReduce can drastically speed up big data tasks by breaking down large datasets and processing them in parallel. The MapReduce paradigm was first proposed in 2004 by Google employees Jeff Dean and Sanjay Ghemawat; it was later incorporated into Apache's Hadoop framework for distributed processing.
The Differences Between Spark and MapReduce
The main differences between Apache Spark and Hadoop MapReduce are:
- Ease of use
- Data processing
However, there are also a few similarities between Spark and MapReduce—not surprising, since Spark uses MapReduce as its foundation. The points of similarity between Spark and MapReduce include:
- Failure tolerance
Below, we'll go into more detail about the differences between Spark and MapReduce (and the similarities) in each section.
Spark vs. MapReduce: Performance
Apache Spark processes data in random access memory (RAM), while Hadoop MapReduce persists data back to the disk after a map or reduce action. In theory, then, Spark should outperform Hadoop MapReduce.
Nonetheless, Spark needs a lot of memory. Much like standard databases, Spark loads a process into memory and keeps it there until further notice for the sake of caching. If you run Spark on Hadoop YARN with other resource-demanding services, or if the data is too big to fit entirely into memory, then Spark could suffer major performance degradations.
MapReduce, on the other hand, kills its processes as soon as a job is done, so it can easily run alongside other services with minor performance differences.
Spark has the upper hand for iterative computations that need to pass over the same data many times. But when it comes to one-pass ETL-like jobs—for example, data transformation or data integration—then that's exactly what MapReduce was designed for.
Bottom line: Spark performs better when all the data fits in memory, especially on dedicated clusters. Hadoop MapReduce is designed for data that doesn’t fit in memory, and can run well alongside other services.
Spark vs. MapReduce: Ease of Use
Spark has pre-built APIs for Java, Scala and Python, and also includes Spark SQL (formerly known as Shark) for the SQL savvy. Thanks to Spark’s simple building blocks, it’s easy to write user-defined functions. Spark even includes an interactive mode for running commands with immediate feedback.
MapReduce is written in Java and is infamously very difficult to program. Apache Pig makes it easier (although it requires some time to learn the syntax), while Apache Hive adds SQL compatibility to the plate. Some Hadoop tools can also run MapReduce jobs without any programming. For example, Xplenty is a data integration service that is built on top of Hadoop and also does not require any programming or deployment.
In addition, MapReduce doesn’t have an interactive mode, although Hive includes a command line interface. Projects like Apache Impala and Apache Tez want to bring full interactive querying to Hadoop.
When it comes to installation and maintenance, Spark isn’t bound to Hadoop. Both Spark and Hadoop MapReduce are included in distributions by Hortonworks (HDP 3.1) and Cloudera (CDH 5.13).
Bottom line: Spark is easier to program and includes an interactive mode. Hadoop MapReduce is more difficult to program, but several tools are available to make it easier.
Spark vs. MapReduce: Cost
Spark and MapReduce are open-source solutions, but you still need to spend money on machines and staff.
Both Spark and MapReduce can use commodity servers and run on the cloud. In addition, both tools have similar hardware requirements:
|Apache Hadoop balanced workload slaves
|Memory||8 GB to hundreds of gigabytes||24 GB|
|Disks||4–8||4–6 one-TB disks|
|Network||10 GB or more||1 GB Ethernet all-to-all|
The memory in the Spark cluster should be at least as large as the amount of data you need to process, because the data has to fit in-memory for optimal performance. If you need to process extremely large quantities of data, Hadoop will definitely be the cheaper option, since hard disk space is much less expensive than memory space.
On the other hand, considering the performance of Spark and MapReduce, Spark should be more cost-effective. Spark requires less hardware to perform the same tasks much faster, especially on the cloud where compute power is paid per use.
What about the question of staffing? Even though Hadoop has been around since 2005, there is still a shortage of MapReduce experts out there on the market. According to a research report by Gartner, 57 percent of organizations using Hadoop say that "obtaining the necessary skills and capabilities" is their greatest Hadoop challenge.
So what does this mean for Spark, which has only been around since 2010? While it might have a faster learning curve, Spark is also suffering from a shortage of qualified experts. A market survey by the Taneja Group found that 6 in 10 Spark users cite the "big data skills/training gap" as the greatest challenge to Spark adoption.
The good news is that there is a wide array of Hadoop-as-a-service offerings and Hadoop-based services (like Xplenty's own data integration service), which help alleviate these hardware and staffing requirements. Meanwhile, Spark-as-a-service options are available through providers such as Amazon Web Services.
Bottom line: Spark is more cost-effective according to the benchmarks, though staffing could be more costly. Hadoop MapReduce could be cheaper because more personnel are available, and it's likely less expensive for massive data volumes.
Spark vs. MapReduce: Compatibility
Apache Spark can run as a standalone application, on top of Hadoop YARN or Apache Mesos on-premise, or in the cloud. Spark supports data sources that implement Hadoop InputFormat, so it can integrate with all of the same data sources and file formats that Hadoop supports. Spark also works with business intelligence tools via JDBC and ODBC.
Bottom line: Spark’s compatibility with various data types and data sources is the same as Hadoop MapReduce.
Spark vs. MapReduce: Data Processing
Spark can do more than plain data processing: it can also process graphs, and it includes the MLlib machine learning library. Thanks to its high performance, Spark can do real-time processing as well as batch processing. Spark offers a "one size fits all" platform that you can use rather than splitting tasks across different platforms, which adds to your IT complexity.
Hadoop MapReduce is great for batch processing. If you want a real-time option you’ll need to use another platform like Impala or Apache Storm, and for graph processing you can use Apache Giraph. MapReduce used to have Apache Mahout for machine learning, but it's since been ditched in favor of Spark and H2O.
Bottom line: Spark is the Swiss army knife of data processing, while Hadoop MapReduce is the commando knife of batch processing.
Spark vs. MapReduce: Failure Tolerance
Spark has retries per task and speculative execution, just like MapReduce. Nonetheless, MapReduce has a slight advantage here because it relies on hard drives, rather than RAM. If a MapReduce process crashes in the middle of execution, it can continue where it left off, whereas Spark will have to start processing from the beginning.
Bottom line: Spark and Hadoop MapReduce both have good failure tolerance, but Hadoop MapReduce is slightly more tolerant.
Spark vs. MapReduce: Security
In terms of security, Spark is less advanced when compared with MapReduce. In fact, security in Spark is set to off by default, which can leave you vulnerable to attack.
Authentication in Spark is supported for RPC channels via a shared secret. Spark includes event logging as a feature, and Web UIs can be secured via javax servlet filters. In addition, because Spark can run on YARN and use HDFS, it can also enjoy Kerberos authentication, HDFS file permissions, and encryption between nodes.
Hadoop MapReduce can enjoy all the Hadoop security benefits and integrate with Hadoop security projects, like Knox Gateway and Apache Sentry. Project Rhino, which aims to improve Hadoop’s security, only mentions Spark in regards to adding Sentry support. Otherwise, Spark developers will have to improve Spark security themselves.
Bottom line: Spark security is still less developed versus MapReduce, which has more security features and projects.
Apache Spark is the shiny new toy on the big data playground, but there are still use cases for using Hadoop MapReduce.
Spark has excellent performance and is highly cost-effective, thanks to its in-memory data processing. It’s compatible with all of Hadoop’s data sources and file formats, and also has a faster learning curve, with friendly APIs available for multiple programming languages. Spark even includes graph processing and machine learning capabilities.
Hadoop MapReduce is a more mature platform, and it was purpose-built for batch processing. MapReduce can be more cost-effective than Spark for extremely large data that doesn’t fit in memory, and it might be easier to find employees with experience in MapReduce. Furthermore, the MapReduce ecosystem is currently bigger thanks to many supporting projects, tools and cloud services.
But even if you think Spark looks like the winner here, chances are you won’t use it on its own. You still need HDFS to store the data, and you may want to use HBase, Hive, Pig, Impala or other Hadoop projects. This means you’ll still need to run Hadoop and MapReduce alongside Spark for the full big data package.
Originally published: March 11th, 2019