Is Apache Spark as hot as you think it is? Although it shines so bright across the Big Data galaxy that some folks think that it may have killed the MapReduce star, Spark is still in its teenage years and has yet to fully mature as a platform. To find out the current state of Spark, we talked to Big Data specialist Uzy Hadad, founder of Inroid, who provide consultancy and training in the field.

Why, in your opinion, is Apache Spark so popular these days?

Uzy: Spark is the second generation of distributed operating systems, following Hadoop. Spark’s developers took most of the good parts from Hadoop, and improved or changed the bad.

Some of Spark’s advantages: running in-memory or on disk or both; more control over sorting operations (Apache Hadoop has a default sorting phase by key); a rich and simple API; Python support; aggregation control per partition, which means less shuffle; and full integration between the SQL engine and other engines with the Spark core engine.

All in all, Spark provides a unified stack on a single platform with batch processing, near real-time querying, SQL support, machine learning capabilities, as well as streaming and graph support.

How many system resources does Spark require?

Uzy: To know how many system resources Spark requires, one should understand how it works.

Spark keeps the working dataset in-memory as a collection of items. This is called a Resilient Distributed Dataset (RDD). RDDs are composed of partitions and the partitions are distributed across the cluster nodes.

A Spark job is composed of stages. Each stage is composed of tasks—Spark’s basic processing unit. Tasks process RDD partitions and each CPU processes one partition. Shuffle tasks move data between nodes and write intermediate output to the local disk for fault tolerance.

Hence, Spark is memory-bound. When the data is in-memory, then Spark is disk, network and CPU-bound.

So, the recommended memory resources are from 8GB to 200GB per worker. As for network resources: since shuffle tasks heavily use the cluster network, for best performance it’s recommended to use 10GB network connections per node or more.

Recommended CPUs: 8-16 cores per node. Disks: 4-8 disks per node—disk size depends on the amount of data, the number of nodes and whether you plan to clean the disks from time to time. It’s recommended to configure Spark to use separate directories on your local disks. This way, the more disks you have, the better performance you will get in the shuffle phase.

What are the current challenges for running Spark?

Uzy: There are several challenges, the first of which is sharing resources dynamically in multi-tenant environments. In version 1.1.1, when running Spark on a cluster, resource allocations are fixed during the application runtime. In version 1.2, there is a new ability to use dynamic resource allocation when running on YARN (this will be extended to Mesos and standalone in a future release). Since version 1.2 is new, there have been reports that the dynamic allocation has a blocker bug at the moment, so it will take a while to evaluate this new and important option.

Another issue is that shuffle operations generate many intermediate results which are saved on disk—a very costly operation. This might affect disks that are also being used by Hadoop.

Finally, the Spark driver does not have high availability. The driver is the coordinator for Spark’s applications that communicates with workers. In the current version, Spark’s driver is a single point of failure and this can be a bit of a problem.

How can these challenges be solved?

Uzy: Be careful with the shuffling process when writing Spark applications and manage your partitions in an optimal way, by, for example, using the operation reducByKey/combineByKey instead of groupByKey.

Currently there is an extensive effort in the Spark community to improve the shuffling issues. Some of the suggestions for improvements have already been implemented in versions 1.1.0 and 1.2; others are in progress.

You can also enhance cluster usage by tuning the number of executors and the executor memory size. Too many executors or too much memory might result in idle CPU, while too few executors or too little memory might result in poor performance.

As for the Spark driver, currently it uses all allocated resources from the cluster until it’s stopped. Therefore, you should manually take care of allocating resources for the Spark shell, Spark SQL, JDBC connections and so on. To gain high availability for the Spark driver, you should take this into account when designing your architecture.

How well does Spark currently support SQL?

Uzy: Although Spark SQL is in alpha, it works pretty well and provides a simple and easy way to write and run SQL queries. Spark SQL uses a query planner/execution engine that reads SQL and plans the way that it will be executed, thus translating logical query plans into Spark RDDs. You can also easily extend Spark SQL by writing your own user-defined functions (UDFs) and you can use it with the Spark Core API.

And Pig?

Uzy: Pig on Spark is developed by Sigmoid Analytics. It is not part of the Spark core components although it is a part of the Spark packages. Pig on Spark uses Pig’s execution plan over MapReduce, hence it doesn’t use Spark’s full functionality. There’s a JIRA ticket about developing new Pig execution plans over Spark.

Is it better to run Spark as a standalone, or on top of Hadoop YARN?

Uzy: If you run Spark on top of YARN, you can utilize services in Hadoop’s ecosystem: HDFS, YARN’s scheduler, YARN’s containers, as well as Kerberos for security purposes.

Another option is running Spark on Mesos. Mesos is a cluster manager which provides an API for resource management and scheduling. The Mesos master can replace the Spark master, so it can take into account other frameworks that run on the cluster and support the dynamic sharing of resources.

Spark can also run as a standalone: it’s simple and easy, it doesn’t require complex cluster management, it’s great for research purposes and it gives you full control of the cluster. Running tasks on a standalone cluster uses the entire cluster resources—there are no penalties for cluster manager, daemons, etc.

In your opinion, what further developments are needed for Spark at the moment?

Uzy: Bug fixes, solving all of the above challenges, finalizing Spark SQL and, again, bug fixes.

Is there anything else that you would like to say about Spark?

Uzy: Spark is a great platform with a rich interface. You can use Spark with Hadoop and you can also work with Spark as standalone or on EC2 or Mesos.

In addition, Spark components provide a unified stack; in other words, it contains several technologies that are built on the Spark core and that use the same concepts. Therefore, the learning curve is easier and there is no need to learn and maintain other solutions. You can even use Spark for streaming.

Moreover, Spark provides a shell interface in Python and Scala which makes the research and development process very fast, easy and enjoyable.

Uzy Hadad is a recognized Big Data specialist with a wide range of experience in research and hands-on consulting services. He is a lecturer at the Academic College of Tel Aviv-Yafo and the founder of Inroid. Inroid supports and guides start-ups and organizations through the new challenges and risks of Big Data. Uzy also holds a PhD in mathematics and computer science from the Hebrew University of Jerusalem.