Apache Spark is setting the world of Big Data on fire. With a promise of speeds up to 100 times faster than Hadoop MapReduce and comfortable APIs, some think this could be the end of Hadoop MapReduce.
How can Spark, an open-source data-processing framework, process data so fast? The secret is that it runs in-memory on the cluster, and that it isn’t tied to Hadoop’s MapReduce two-stage paradigm. This makes repeated access to the same data much faster.
Spark can run as a standalone or on top of Hadoop YARN, where it can read data directly from HDFS. Companies like Yahoo, Intel, Baidu, Trend Micro and Groupon are already using it.
Sounds like Spark is bound to replace Hadoop MapReduce. Or is it? In this post we’ll compare the two platforms and see if Spark truly comes out on top of the elephant.
Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action, so Spark should outperform Hadoop MapReduce.
Nonetheless, Spark needs a lot of memory. Much like standard DBs, it loads a process into memory and keeps it there until further notice, for the sake of caching. If Spark runs on Hadoop YARN with other resource-demanding services, or if the data is too big to fit entirely into the memory, then there could be major performance degradations for Spark.
MapReduce, however, kills its processes as soon as a job is done, so it can easily run alongside other services with minor performance differences.
Spark has the upper hand as long as we’re talking about iterative computations that need to pass over the same data many times. But when it comes to one-pass ETL-like jobs, for example, data transformation or data integration, then MapReduce is the deal—this is what it was designed for.
Bottom line: Spark performs better when all the data fits in the memory, especially on dedicated clusters; Hadoop MapReduce is designed for data that doesn’t fit in the memory and it can run well alongside other services.
Ease of Use
Spark has comfortable APIs for Java, Scala and Python, and also includes Spark SQL (formerly known as Shark) for the SQL savvy. Thanks to Spark’s simple building blocks, it’s easy to write user-defined functions. It even includes an interactive mode for running commands with immediate feedback.
Hadoop MapReduce is written in Java and is infamous for being very difficult to program. Pig makes it easier, though it requires some time to learn the syntax, and Hive adds SQL compatibility to the plate. Some Hadoop tools can also run MapReduce jobs without any programming. Xplenty is a data integration service that is built on top of Hadoop and also does not require any programming or deployment.
MapReduce doesn’t have an interactive mode, although Hive includes a command line interface. Projects like Impala, Presto and Tez want to bring full interactive querying to Hadoop.
When it comes to installation and maintenance, Spark isn’t bound to Hadoop, yet both Spark and Hadoop MapReduce are included in distributions by Hortonworks (HDP 2.2) and Cloudera (CDH 5).
Bottom line: Spark is easier to program and includes an interactive mode; Hadoop MapReduce is more difficult to program but many tools are available to make it easier.
Both Spark and Hadoop MapReduce are open source, but money still needs to be spent on machines and staff.
They can both use commodity servers and run on the cloud. They seem to have similar hardware requirements:
|Apache Hadoop balanced workload slaves|
|Memory||8 GB to hundreds of gigabytes||24 GB|
|Disks||4–8||4–6 one-TB disks|
|Network||10 GB or more||1 GB Ethernet all-to-all|
The memory in the Spark cluster should be at least as large as the amount of data you need to process, because the data has to fit into the memory for optimal performance. So, if you need to process really Big Data, Hadoop will definitely be the cheaper option since hard disk space comes at a much lower rate than memory space.
On the other hand, considering Spark’s benchmarks, it should be more cost-effective since less hardware can perform the same tasks much faster, especially on the cloud where compute power is paid per use.
As for staffing, even though Hadoop has been around since 2005, there is still a shortage of MapReduce experts out there on the market. What does this mean for Spark, which has only been around since 2010? Maybe it has a faster learning curve, but it still lacks way more skilled ninjas out there compared to Hadoop MapReduce.
Furthermore, there is a wide array of Hadoop-as-a-service offerings and Hadoop-based services (like our own Xplenty’s data integration service), which help to skip the hardware and staffing requirements. In comparison, there are few Spark-as-a-service options and they are all very new.
Bottom line: Spark is more cost-effective according to the benchmarks, though staffing could be more costly; Hadoop MapReduce could be cheaper because more personnel are available and because of Hadoop-as-a-service offerings.
Apache Spark can run as standalone or on top of Hadoop YARN or Mesos on-premise or on the cloud. It supports data sources that implement Hadoop InputFormat, so it can integrate with all the data sources and file formats that are supported by Hadoop. According to the Spark website, it also works with BI tools via JDBC and ODBC. Hive and Pig integration are on the way.
Bottom line: Spark’s compatibility to data types and data sources is the same as Hadoop MapReduce.
Apache Spark can do more than plain data processing: it can process graphs and use the existing machine-learning libraries. Thanks to its high performance, Spark can do real-time processing as well as batch processing. This presents an interesting opportunity to use one platform for everything instead of having to split tasks across different platforms, all of which require learning and maintenance.
Hadoop MapReduce is great for batch processing. If you want a real-time option you’ll need to use another platform like Storm or Impala, and for graph processing you can use Giraph. MapReduce used to have Apache Mahout for machine learning, but the elephant riders have ditched it in favor of Spark and h2o.
Bottom line: Spark is the Swiss army knife of data processing; Hadoop MapReduce is the commando knife of batch processing.
Spark has retries per task and speculative execution—just like MapReduce. Nonetheless, because MapReduce relies on hard drives, if a process crashes in the middle of execution, it could continue where it left off, whereas Spark will have to start processing from the beginning. This can save time.
Bottom line: Spark and Hadoop MapReduce both have good failure tolerance, but Hadoop MapReduce is slightly more tolerant.
Spark is a bit bare at the moment when it comes to security. Authentication is supported via a shared secret, the web UI can be secured via javax servlet filters, and event logging is included. Spark can run on YARN and use HDFS, which means that it can also enjoy Kerberos authentication, HDFS file permissions and encryption between nodes.
Hadoop MapReduce can enjoy all the Hadoop security benefits and integrate with Hadoop security projects, like Knox Gateway and Sentry. Project Rhino, which aims to improve Hadoop’s security, only mentions Spark in regards to adding Sentry support. Otherwise, Spark developers will have to improve Spark security themselves.
Bottom line: Spark security is still in its infancy; Hadoop MapReduce has more security features and projects.
Apache Spark is the shiny new toy on the Big Data playground, but there are still use cases for using Hadoop MapReduce.
Spark has excellent performance and is highly cost-effective thanks to in-memory data processing. It’s compatible with all of Hadoop’s data sources and file formats, and thanks to friendly APIs that are available in several languages, it also has a faster learning curve. Spark even includes graph processing and machine-learning capabilities.
Hadoop MapReduce is a more mature platform and it was built for batch processing. It can be more cost-effective than Spark for truly Big Data that doesn’t fit in memory and also due to the greater availability of experienced staff. Furthermore, the Hadoop MapReduce ecosystem is currently bigger thanks to many supporting projects, tools and cloud services.
But even if Spark looks like the big winner, the chances are that you won’t use it on its own—you still need HDFS to store the data and you may want to use HBase, Hive, Pig, Impala or other Hadoop projects. This means you’ll still need to run Hadoop and MapReduce alongside Spark for a full Big Data package.