Hadoop YARN Turns One: Upgrading to YARN

Hadoop YARN Turns One: Upgrading to YARN

(Image by Street knitting Avilés [Rebes], Some rights reserved)

Happy birthday YARN! Hadoop’s second major version, released one year ago, was big news for the Big Data world. It brought in several major enhancements, the biggest of which was adding the YARN (Yet Another Resource Negotiator) layer on top of HDFS. Access to Hadoop’s Distributed File System allowed data to be crunched in new and exciting ways other than the good old MapReduce.

YARN’s promise was big and exciting, but did it deliver? Now that we’ve spent a year with the baby elephant, we're happy to announce YARN Week: a three post series about our YARN thoughts and experiences. First post: upgrading to YARN.

Upgrading to YARN

Upgrading to YARN wasn’t easy. So much configuration, testing and fine-tuning were needed that we felt like we were setting up Hadoop from scratch. It took plenty of time to find out whether YARN suited our needs, and although it brought in a lot of improvements, some of them weren’t welcome.

The new version of MapReduce, MRv2, required us to reframe the way we thought about running jobs. In MRv1 we could set the number of slots available for maps vs. slots available for reducers. MRv2 threw that out of the window—instead, it allocates general resources for jobs as a whole. This means that in YARN, mappers could take up all the resources, block reducers from processing the data and thus slow down the job execution time. Or vice versa. Certain reducer operations, such as sorting, require all the data before they can work their magic. Such reducers can hog cluster resources until they receive the entire data set and hinder mappers from processing the data.

All in all, we felt that YARN doesn’t have as much mileage as MRv1, so not every nook and cranny may have been explored thoroughly. It could also be the case that small clusters don’t gain a lot from YARN. MRv1 allowed us to overdrive the system by configuring more mappers/reducers than were actually available. Since the system never reached full capacity, this meant that mapper/reducer slots were always available. This is harder to do in YARN which is quite strict about resource management.

(Part of YARN week: a three post series about YARN's past, present and future)

Integrate Your Data Today!

Get a 7-day free trial. No credit card necessary.