(Image by Gabriel Pollard, Some rights reserved)
This comes as no surprise. Apache Spark, an open-source data analytics cluster computing framework, promises to deliver up to 100 times better performance than MapReduce. Spark can run standalone or on top of YARN where it can read data from HDFS.
How does Spark perform so well? The answer is that it isn’t tied to the two-stage MapReduce paradigm and runs in-memory on the cluster. Yahoo and Intel are already using it and soon so will you—thanks to YARN, standard Hadoop frameworks such as Oozie will also be able to integrate directly with Spark. Such promises of performance and compatibility is the stuff that data science dreams are made of.
Before you throw away your mappers and reducers, keep in mind that Apache Spark may not work well under stress. Sure, Spark can query in-memory data at the speed of light. However, when many users run many queries, Spark’s system resources may get exhausted just as fast. MapReduce doesn’t suffer from the same problem—it clears the memory as soon as the data is finished processing. The radio star might not be dead just yet.
YARN Week Summary
YARN had a mixed first year. We don’t think that many organizations upgraded to YARN due to fears and difficulties, not to mention that they would need to learn how to manage resources the YARN way. But YARN was definitely good news for Hadoop, opening it up to many applications beyond MapReduce. One of these applications—Apache Spark—could be the new poster boy for Big Data, but MapReduce may tick around for a while. Who knows what the next major Hadoop version will bring.
(Part of YARN week: a three post series about YARN's past, present and future)