Installing, configuring, and maintaining Apache Hadoop requires a high level of expertise. Unfortunately, the job market is short on professionals that have these skills. As such, Amazon launched Elastic Map Reduce (EMR) to make it easier, but there were still too many Hadoop hoops to jump through before processing Big Data. That’s why we founded Xplenty, Data Integration-as-a-Service on the cloud that lets you use Hadoop without coding or deployment. Since we both claim to make working with Big Data easier, we decided to run a quick comparison of Xplenty vs. EMR.
Have a look at Xplenty for yourself: Create an account and take the platform for a free 14 day spin.
EMR and Xplenty are based on an infrastructure of Amazon EC2 instances, allowing them to take advantage of Hadoop’s ability to scale horizontally on the cloud. Clusters can grow as needed or even be discarded as soon as the data has finished processing. In this sense both are highly scalable.
There is one difference between the two. Depending on the the type of account, Xplenty has a limit on how many nodes can be added per cluster. EMR, however, allows using as many instances as possible within the limits of your account. The default limit is 20 instances per region, and the limit can be changed by submitting a request to Amazon.
Ease of Use
Even though EMR saves some installation and configuration hassle, it does not remove it all. EMR enables easy deployment of Pig, Hive, HBase, and Ganglia, but other packages have to be installed manually. Pretty soon you will need workflow management (e.g. Oozie), or relational database integration (Sqoop). It’s up to you to set them up. Also, debugging with EMR can be difficult since full logs need to be inspected in order to understand what went wrong. Basically, techie skills are still required to fully utilize Hadoop on EMR.
With Xplenty, no installations are required, just signing up for an account. Workflow management and integration with relational databases are already included. Xplenty provides a full user interface to provision clusters and design ETL flows. This means that no techie skills are required, and even data integration can be done without writing any code. Debugging with Xplenty is much simpler as well since it provides concise error logs and package validation prior to creating a cluster.
As mentioned above, both services are based on EC2 instances. The only difference is that EMR allows selecting instance types while Xplenty comes with a default instance type (m1.xlarge). Therefore, comparing Xplenty with EMR using the same type and number of instances should show similar performance.
Xplenty’s prices start with $499 a month for 250 node hours using m1.xlarge instances. On Amazon, these instances cost $0.350 per hour for EC2 and $0.088 per hour for EMR, in total $0.438 per hour.
EMR is cheaper than Xplenty, but it only provides the infrastructure. Xplenty also provides the ability to easily write applications without having to learn Hadoop, thus saving the salaries of a few Hadoop engineers and training courses.
Data is transferred securely in EMR between S3 and the EC2 clusters via SSH. EMR provides user access control via Amazon’s IAM - an engine for managing user identity and permissions. It also has security groups that control which instances can transfer traffic between each other.
Xplenty, which runs Hadoop on EC2, also transfers data securely via SSH. User permission is less granular though - if a user has been added to an account then he has the same access level as the other users do. Xplenty requires credentials for minimal access, read or read/write where applicable.
EMR works with Amazon S3 by default. If the data is stored in a relational database or a NoSQL database, it is the user’s responsibility to install the right packages and setup the integration. In case the data is available on another cloud service, some custom code may have to be written, and it won’t be pretty. The data pipelining feature, which is used for workflow management, may be of help.
When it comes to cloud storage, Xplenty integrates with Amazon S3 and is preconfigured to work with relational databases and with MongoDB. Actually, Xplenty can run Hadoop not just over Amazon Web Service, but also on IBM Softlayer and Rackspace and read/write data from their cloud storage.
EMR users can choose EC2 instance types, Hadoop distributions, and if they have the right know-how install additional packages and tools. A command line interface and an API are both available.
Xplenty comes with a user interface that doesn’t offer any backend access. It comes ready with a Hadoop distribution, Oozie for workflow management, and Sqoop for integrating with relational databases. Additional low-level customization is currently not in the cards, but an API is available to manage jobs and clusters.
Both EMR and Xplenty make data integration on the cloud easier. EMR is suited for companies with high budgets and Hadoop engineers who want high customization and do not mind a bit of maintenance and coding MapReduce jobs. Xplenty is great for those who want to start processing Big Data without having to deal with any backend mumbo jumbo. At the end of the day, both rely on Hadoop technology that runs on Amazon’s EC2 servers on the cloud, so they are both secure and scalable solutions.