But, it's Python that continues to dominate the ETL space. There are well over a hundred Python tools that act as frameworks, libraries, or software for ETL. We're going to look at a few of them. To be specific, we're comparing some of the more popular tools to see where they shine.
Before we start, let's address why you would want to set up an ETL pipeline using Python as opposed to an ETL tool. After all, ETL tools are developed and maintained by professionals who live-and-breathe ETL.
For most of you, ETL tools become the go-to once you start dealing with complex schemas and massive amounts of data. You certainly can use SQLAlchemy and pandas to execute ETL in Python. But, it is time-consuming, labor-intensive, and often overwhelming once your schema gets complex.
There are three primary situations where Python makes sense.
- You personally feel comfortable with Python and are dead set on building your own ETL tool.
- You have extremely simple ETL needs.
- You have a unique, hyper-specific need that can only be met via custom coding an ETL solution through Python.
If you fit into one of those three categories, you have a wide variety of options on the market.
The tools below are some of the many Python tools available for ETL needs. We selected these tools based on usability, popularity, and diversity. So, each of these tools may be the solution that you're looking for personally.
1. Apache Airflow
Apache Airflow is an open-source Python-based workflow automation tool used for setting up and maintaining data pipelines. An important thing to remember here is that Airflow isn't an ETL tool. Instead, it helps you manage, structure, and organize your ETL pipelines using Directed Acyclic Graphs (DAGs). The easiest way to think about DAGs is that they form relationships and dependencies without actually defining tasks. The nature of DAGs gives you the room to run a single branch multiple times or skip branches of your sequence when necessary. So, you can say that A should run after B and that C should run ever 2 minutes. Or, you can say that A and B run every 2 minutes and C runs after B.
The typical Airflow setup looks like this:
Metadata database > Scheduler > Executor > Workers
So, the metadata database will store your workflows/tasks (i.e., DAGs), the scheduler (typically run as-a-service) uses your DAG definitions to select tasks, and the executor will determine which worker executes your task. Workers are simply the processes that execute the logic of your workflow/task.
Apache Airflow makes a great addition to your existing ETL toolbox since it's incredibly useful for management and organization. Also, there are already Google Cloud and AWS hooks and operators available for Airflow, so it has the main integrations that make it useful for cloud warehousing environments.
When does Apache Airflow make sense? If you're performing long ETL jobs or your ETL has multiple steps, Airflow will let you restart from any point during the ETL process. That being said, Apache Airflows IS NOT a library, so it has to be deployed and may make less sense on small ETL jobs.
Some things to note about Apache Airflow
- The Apache Airflow project was started by Maxime Beauchemin at Airbnb.
- The Apache Project announced that Airflow is a Top-Level Project in 2019.
- The majority of Airflow users leverage Celery as their executor, which makes managing execution simple.
- You can manage all of your DAG workflows via the Airflow WebUI. This means that you can "set it and forget it" by scheduling automated workflows.
- Airflow lets you execute through a command-line interface, which can be extremely useful for executing tasks in isolation outside of your scheduler workflows.
- If you're looking for a workflow orchestrator like Airflow, Prefect is a solid alternative with some advanced features. You can migrate your DAGs straight over.
Luigi (aptly named after a certain plumber) is an open-source Python-based tool that lets you build complex pipelines. Originally developed by Spotify to automate their insane workloads (think terabytes of data daily,) and it's currently used by a wide variety of big companies like Stripe and Red Hat.
There are three main benefits to Luigi.
- It gives you dependency management with stellar visualization.
- It provides failure recovery via checkpoints.
- It has command-line interface integration.
The primary difference between Luigi and Airflow is the way that dependencies are specified and tasks are executed.
With Luigi, you have Tasks and Targets. Tasks consume targets, which are generated by a finished task. So, a task will plop out a target, another task will eat that target and plop out another target. This makes the entire process extremely straightforward, and it keeps workflows simple. This target-based approach is perfect for simple ETL tasks. But, it can get cumbersome (if not impossible) with hyper-complex tasks.
When does Luigi make sense? If you need to automate simple ETL processes (like logs) Luigi can handle them rapidly and without much setup. When it comes to complex tasks, Luigi is limited by its strict pipeline-like structure.
Some things to note about Luigi:
- You cannot interact with processes once they are running with Luigi.
- Luigi definitely isn't "set it and forget it." Luigi doesn't sync tasks to workers for you, schedule, alert, or monitor like Airflow would.
- Luigi's UI (or lack thereof) can be a pain point.
If you've been working with Python for a while, you probably know about pandas. This library provides data structures/analysis tools for Python. In particular, panda adds R-style data frames — which makes ETL processes (like data cleansing) easier.
When it comes to ETL, you can do almost anything with Pandas if you're willing to put in the time. Plus, pandas is extraordinarily easy to run. You can set up a simple script to load data from a Postgre table, transform and clean that data, and then write that data to another Postgre table.
It's not all fun-and-games. When it comes to in-memory and scalability, pandas performs relatively poorly. You can scale pandas using parallel chunks, but, again, it's not as simple as something like Airflow would be for scaling ETL operations.
When does pandas make sense? When you need to rapidly extract data, clean and tranform it, and write it to an SQL database/Excel/csv. Once you start working with large data sets, it usually makes more sense to use a more scalable approach.
Some things to note about pandas:
- pandas is sponsored by NumFocus.
- A large chunk of Python users looking to ETL a batch start with pandas.
When it comes to ETL, petl is the most straightforward solution. The aptly named Python ETL solution does, well, ETL work. You can build tables in Python, extract data from multiple sources, etc. It's similar to pandas in functionality, though it doesn't feature the same level of analysis. petl can handle hyper-complex datasets, makes good use of system memory, and has incredible scale. But, it definitely lacks in the speed department.
That being said, it's much easier to leverage petl than it is to build your own ETL using SQLAlchemy or other custom-coded solutions.
When does petl make sense? When you want the basics of ETL without the analytics and you're not concerned with speed.
Some things to note about petl:
- petl isn't known for its speed or ability to handle large datasets.
- petl stands for "Python ETL."
Bonobo is a lightweight Python ETL framework that's incredibly easy-to-use and lets you rapidly deploy pipelines and execute them in parallel. With Bonobo you can extract from a variety of sources (e.g., CSV, JSON, XML, XLS, SQL, etc.) and the entire transformation follows atomic UNIX principles. The best thing about Bonobos is that new users don't have to learn a new API. If you know Python, working in Bonobo is a breeze. Graphs can be built with business needs in mind, and you can build your own library or do simple ETL batch automation.
Really, Bonobo is the "everyone" tool for Python users. It's simple. It can handle semi-complex schemas. Plus, it's open-source and scalable.
When does Bonobo make sense? When you need simple, lightweight ETL jobs done, and you don't have the time to learn a new API.
Some things to note about Bonobo:
- There is an official Bonobos Docker extension that lets you run jobs within Docker containers.
- Bonobo has CLI execution.
- Bonobo has Graphviz for ETL job visualization.
- There is an official SQLAlchemy extension (in alpha).
Bubbles is another Python framework that you can use to run ETL. Unlike some other ETL frameworks, Bubbles uses metadata to describe pipelines as opposed to script-based. While Bubbles is written for Python, the author claims that it's not meant to be Python-exclusive in nature. A driving theme of Bubbles is that it's technologically agnostic, so you don't have to worry about working with or accessing data — just the transformation.
When does Bubbles make sense? If you need a rapid ETL setup that's technologically agnostic — giving you the freedom to focus solely on the ETL process.
Some things to note about Bubbles:
- Bubbles is still in the prototype phase.
- The primary draw to Bubbles is that it's abstract, making working directly on ETL a focus as opposed to learning about the query language.
Other ETL Tools, Libraries & Frameworks
Don't worry! We're not going to ignore all of the other tools. There are simply too many Python (and other) ETL tools that handle ETL to really count. We tried to keep our list simple by including multiple popular ETL options that all have different use-cases. But, there are plenty of other tools that we could have mentioned.
In this section, we're going to list other popular ETL tools by language.
- BeautifulSoup: This Python tool pulls data out of webpages (XML, HTML). Has integrations with tons of ETL tools like petl.
- PyQuery: Also extracts data from webpages but with a jquery-like syntax.
- Blaze: This is an interface that queries data. Also, this is part of the "Blaze Ecosystem" which is a framework for an ETL process using Blaze, Dask, Datashape, DyND, and Odo.
- Dask: You can use Dask for Parallel computing via task scheduling. It can also process continuous data streams. Again, this is part of the "Blaze Ecosystem."
- Datashape: This is a simple data-description language that's similar to NumPly. It describes in-situ structured data without the need for canonical transformation.
- DyND: This is the Python exposure for DyND — which is a C++ library for dynamic, multidimensional arrays.
- Odo: This lets you move data between multiple containers. Odo lets you use the native CSV loading capabilities of SQL databases, which is faster than trying to load with Python.
- Joblib: This is yet another set of tools that use Python functions for pipelines, and it has a few unique perks that make it suitable for certain jobs. In particular, Joblib has optimizations for numpy arrays.
- lxml: This library processes HTML and XML in Python rapidly.
- Retrying: This lets you add retry behavior to executions.
- riko: This is a replacement for Yahoo! Pipes, and it's particularly useful for stream data. It's not a full-blown ETL solution, but it's pure Python and it makes extracting data streams much easier.
- Xplenty: Want to set up automated pipelines across a massive variety of sources using best-of-breed visualization and rapid integrations? That's Xplenty!
- AWS Data Pipeline: This is Amazon's data pipeline solution. You can set up pipelines between AWS instances and legacy servers.
- AWS Glue: This is Amazon's fully managed ETL solution. It's point-and-click, and you manage it directly through your AWS Management Console.
- AWS Batch: This is used for batch computing jobs on AWS resources. It has insane scalability and is well-suited for engineers look to do large compute jobs.
- Google Dataflow: This is Google's ETL solution for batch jobs and streams.
- Azure Data Factory: This is Microsoft's ETL solution.
- Toil: This USCS project which handles ETL almost identically to Luigi. In fact, you can use Luigi to wrap Toil pipelines for additional checkpointing.
- Pachyderm: This is another great alternative to tools like Airflow. Here's a great GitHub writeup about some of the simple differences between Airflow and Pachyderm. Note: Paychyderm has an open-source edition on their website.
- Mara: This is another ETL framework for Python. It's a middle-ground between pure Python and Apache Airflow, so it's fast and simple to set up.
- Pinball: This is Pinterest's workflow manager. It has auto-retries, priorities, overrun policies, and tons of horizontal scalability.
- Azkaban: Created at LinkedIn, Azkaban is Java-based and used for Hadoop batches. If you're trying to run hyper-complex Hadoop batches, think about Oozie instead.
- Dray.it: This is a Docker workflow engine that helps tremendously with resource management.
- Spark: This is a full-blown toolkit that has tons of helpful tools. With Spark Streaming you can set up your entire batch streaming ETL.
- Pentaho Kettle: This is an ETL visualization and graphical tool.
We won't dive into these particular tools. But, we'll list them off for those of you looking for ETL tools outside of the Python space.
Read This If You Need A Better ETL Solution
When it comes to Python ETL frameworks, libraries, and tools, you have plenty of options. Whether you want to leverage multiple tool layers to develop your ETL solution with Python or you want an out-of-the-box experience with a cloud-based ETL tool like Xplenty, you can definitely find something that works for you.
Are you looking for an ETL solution that's simple enough to make everyday use pain-free but complex enough to handle your demanding workloads? With Xplenty you can set up robust pipelines in minutes. We'll help you find the value hidden in your tech stack. Contact us to learn more!