Hadoop ETL with Apache Pig

Hadoop ETL with Apache Pig

(Itchy Piglet image by ekilby, some rights reserved)

What does it mean to be a pig? Well, according to the philosophers behind the Apache Pig project, pigs eat anything, live anywhere, and are domestic animals. They even claim that pigs can fly!

In all seriousness, Apache Pig is a high-level language for processing data in parallel. It is intended to be a general framework, but for the moment Pig only works with Apache Hadoop. Much like Hive’s HQL, Pig translates its language, Pig Latin, to MapReduce jobs. Both Pig and Hive can be used to run jobs on Hadoop instead of writing MapReduce code in Java, although Pig is specifically designed for ETL and batch processing. So before you start using Pig for ETL, here are several things of which you should be made aware.

Want to handle your ETL jobs quickly and painlessly? Open a free account with Xplenty now.

Pig Tips

Load/Store Functions

Pig has some built-in load and storage functions for binary data, json, text files, and more. If they cannot handle your data, you will need to write your own functions. So even though Pig is supposed to save you from coding, some code may have to be written or at least copy-pasted from the web.

User Defined Functions (UDFs)

If Pig’s available functions are not enough, it also supports writing user defined functions (UDFs). In fact, the only way to handle collections, or bags as they are called in Pig Latin, is by writing UDFs. They can:

  1. Receive data as input and return data as output (load/store functions as discussed previously)

  2. Receive parameters and return a single scalar value

  3. Receive a collection as input and return a collection or another value

UDFs have to be coded manually. The good news is that this can be done in several programming languages: Java, Jython, JavaScript, jRuby or Groovy. Pig’s latest version (0.12) supports streaming UDFs, so other scripting languages/engines can be used. Pick your favorite and code away.

Storing Intermediate Data

Pig can store the data at any point in the data flow, a feature which is really useful for development and debugging. Doing the same thing with Hive is difficult because it’s hard to dismantle an HQL query, not to mention the data would need to be saved in a temporary table.

Pipelining Data

Let's say the data needs to be processed externally as part of the ETL by a Bash script or a Python program. Pig can pipeline the data to an external application, wait until it is finished, receive the processed data and continue from there. It is also possible to pipeline the data from Pig to SQL, although it requires loading it to the relevant DB.


(Image by Vjeran Pavic, Some rights reserved)

Null Issues

Some of Pig’s functions have problems handling null values in the data and may return one of those ugly NullPointerExceptions, an exception that Java returns when it expects a reference to an object or value but receives null instead. This is inconsistent though, as some of Pig’s functions can handle nulls while others cannot. Filtering nulls out before doing any aggregation can help to avoid this issue, although you may not remember to do so every time, or may not even want to filter them.


Only data flows are supported by Pig, not workflows. In other words, conditionals, scheduling, and job coordination are not available as part of the framework. If you need workflow management, check out additional projects that integrate with Pig such as Oozie, Azkaban, and Luigi.

GROUP Statements

GROUP doesn’t work like SQL’s GROUP BY statement. Instead of aggregating the data like SQL does, Pig returns a collection of the rows which have the same value as the field. Also, to make things more complicated, if you group by several fields, group becomes a tuple of the group field.

For example, running band_shows = GROUP shows by band; will return two fields - group populated by the band names, and shows containing collections of tuples with entries that have the same band values:

band shows
rolling stones {<rolling stones, oslo, 2014-05-14>, <rolling stones, lisbon, 2014-05-29> }
sky ferreira {<sky ferreira, iowa city, 2014-04-13> }

JOIN Statements

Unlike SQL, only equi-JOINs are available in Pig, meaning that data can be joined only on the basis of equality between fields and no other operators. This could be a hindrance in certain cases, for example, of joining data by date fields when you want to bring only newer or older dates. The solution is to write UDFs or use cross-join followed by filter, for example:

Events = LOAD 'sales.txt';  
Events2 = Events;  
Product = CROSS JOIN Events, Events2;  
Result = FILTER Product BY Events::EventID = Events2::EventdID AND Events::EventDate > Events2::EventDate;


Pig can be used to run ETL jobs on Hadoop. It saves you from writing MapReduce code in Java while its syntax may look familiar to SQL users. Nonetheless, it is important to take some time to learn Pig and to understand its advantages and limitations. Who knows, maybe pigs can fly after all.

Integrate Your Data Today!

Get a 7-day free trial. No credit card necessary.