These days, there is a renaissance of Hadoop-based Big Data projects: Impala, Spark, Storm, Flink and HBase as well as several SQL-on-Hadoop tools. Most of these projects are still in their infancy though, if not in the Apache Incubator, so they’re mostly used by early adopters and haven't become an industry standard. Yet.
But the question is this: do Hadoop-based projects need YARN? HBase was released before YARN while Spark and Storm can run regardless of YARN. YARN is supposed to govern resources for these applications in a centralized place and that's something which hasn’t been available before. So YARN can do a whole lot more than MapReduce, but how many applications actully support it?
In fact, we’re not sure just how many organizations out there use YARN these days. Many companies probably stuck to their existing Hadoop setup, trusting that old engineer idiom: “If it ain’t broke don’t fix it.” Most chances are that if you’re starting a new Hadoop project today you’ll use YARN, otherwise you’ll stay with your current elephant until further notice.
This plethora of projects seems appealing, but they’ve created a major issue: figuring out what each engine is good for. Impala, Spark, Tez, Presto and company have a lot in common and require plenty of memory. So when should they be used and how well do they play together? Questions such as “When would someone use Apache Tez instead of Apache Spark?” or “How does Impala compare to Shark?” seem to be popping up everywhere. We're curious to see what will happen with all these engines after the Big Data dust settles.
(Part of YARN week: a three post series about YARN's past, present and future)