ETL (extract, transform, load) is the leading method of data integration for software developers the world over. By providing an efficient way of extracting information from different sources and collecting it in a centralized data warehouse, ETL is the engine that has powered the business intelligence and analytics revolution of the 21st century.
While ETL is a high-level concept, there are many ways of implementing ETL under the hood, including both pre-built ETL tools and coding your own ETL workflow. Thanks to its ease of use and popularity for data science applications, Python is one of the most widely used programming languages for building ETL pipelines.
Creating an ETL pipeline from scratch is no easy task, even if you’re working with a user-friendly programming language like Python. The good news is that there’s no shortage of ETL Python frameworks at hand to simplify and streamline the ETL development process.
But what is an ETL Python framework exactly, and what are the best ETL Python frameworks to use? In this article, we’ll go over everything you need to know about choosing the right Python framework for building ETL pipelines.
Table of Contents:
What is an ETL Python Framework?
An ETL Python framework is a foundation for developing ETL software written in the Python programming language.
In general, Python frameworks are reusable collections of packages and modules that are intended to standardize the application development process by providing common functionality and a common development approach. For example, some of the most popular Python frameworks are Django for web application development and Caffe for deep learning.
ETL Python frameworks, naturally, have been created to help developers perform batch processing on massive quantities of data. These frameworks make it easier to define, schedule, and execute data pipelines using Python.
ETL Python Frameworks vs. Libraries
The terms “framework” and “library” are often used interchangeably, even by experienced developers. To a certain degree, conflating these two concepts is understandable. Both frameworks and libraries are collections of code written by a third party with the goal of simplifying the software development process.
However, there are important differences between frameworks and libraries that you should know about, especially when it comes to ETL Python code:
- A software library is a collection of helper functions and objects to assist with the software development process. Libraries allow developers to write “plug and play” code, inserting library functions in their code base as needed to save the time and effort of writing these functions themselves.
- A software framework may consist of one or more libraries, all oriented toward a common purpose. Unlike libraries, frameworks usually dictate the overarching structure and architecture of your application, defining a design philosophy that developers must obey. Frameworks are suited to a “fill in the blanks” style of application development, in which developers insert the necessary code in order to make the framework function.
TRUSTED BY COMPANIES WORLDWIDE
Enjoying This Article?
Receive great content weekly with the Xplenty Newsletter!
4 Top ETL Python Frameworks
With all that said, what are the best ETL Python frameworks to use for your next data integration project? Below, we’ll go over 4 of the top Python ETL frameworks that you should consider.
Bonobo bills itself as “a lightweight Extract-Transform-Load (ETL) framework for Python 3.5+,” including “tools for building data transformation pipelines, using plain Python primitives, and executing them in parallel.”
Using Bonobo, developers can easily extract information from a variety of sources, including XML/HTML, CSV, JSON, Excel files, and SQL databases. Then, you can use pre-built or custom transformations to apply the appropriate changes before loading the data into your target data warehouse.
More specifically, data in Bonobo is streamed through nodes in a directed acyclic graph (DAG) of Python callables that is defined by the developer (i.e. you). Each node runs in parallel whenever possible on an independent thread, slashing runtime and helping you avoid troublesome bottlenecks.
Bonobo developers prioritized simplicity and ease of use when building the framework, from the quick installation process to the user-friendly documentation. The building blocks of ETL pipelines in Bonobo are plain Python objects, and the Bonobo API is as close as possible to the base Python programming language. Bonobo also includes integrations with many popular and familiar programming tools, such as Django, Docker, and Jupyter notebooks, to make it easier to get up and running.
Bottom line: Bonobo is an ETL Python framework that’s appealing for many different situations, thanks to its ease of use and many integrations.
The amusingly-named Bubbles is “a Python framework for data processing and data quality measurement.”
One important thing to note about Bubbles is, while the framework is written in Python, the framework’s author Stefan Urbanek claims that Bubbles is “not necessarily meant to be used from Python only.” Instead of implementing the ETL pipeline with Python scripts, Bubbles describes ETL pipelines using metadata and directed acyclic graphs. Each operation in the ETL pipeline (e.g. data aggregation, data filtering, data cleansing, etc.) is represented by a node in the graph.
The core concept of the Bubbles framework is the data object, which is an abstract representation of a data set. Bubbles can extract information from sources including CSV files, SQL databases, and APIs from websites such as Twitter.
Bottom line: Bubbles is best-suited for developers who aren’t necessarily wedded to Python, and who want a technology-agnostic ETL framework.
pygrametl describes itself as “a Python framework which offers commonly used functionality for development of Extract-Transform-Load (ETL) processes.” First made publicly available in 2009, pygrametl is now on version 2.6, released in December 2018. According to pygrametl developer Christian Thomsen, the framework is used in production across a wide variety of industries, including healthcare, finance, and transport.
Most notably, pygrametl is compatible with both CPython (the original Python implementation written in the C programming language) and Jython (the Java implementation of Python that runs on the Java Virtual Machine). This makes it a good choice for ETL pipelines that may have code in multiple programming languages.
In general, pygrametl operates on rows of data, which are represented under the hood as Python dictionaries. pygrametl also includes support for basic parallelism when running ETL processes on multi-core systems.
Bottom line: pygrametl’s flexibility in terms of programming language makes it an intriguing choice for building ETL workflows in Python.
Enjoying This Article?
Receive great content weekly with the Xplenty Newsletter!
Mara is “a lightweight ETL framework with a focus on transparency and complexity reduction.” In the words of its developers, Mara sits “halfway between plain scripts and Apache Airflow,” a popular Python workflow automation tool for scheduling execution of data pipelines.
As an “opinionated” Python ETL framework, Mara has certain principles and expectations for its users, including:
- The use of PostgreSQL as a data processing engine.
- A web-based UI for inspecting, running, and debugging ETL pipelines. The UI includes helpful visualizations such as a graph of all nodes and a chart breaking down the pipeline by each node’s runtime.
- A priority queue that ranks nodes on the cost (i.e. time) of executing them, with costlier nodes running first.
To date, Mara is still lacking documentation, which could dissuade anyone looking for a Python ETL framework with an easier learning curve. However, Mara does provide an example project that can help users get started. Also, Mara currently does not run on the Windows operating system.
Bottom line: Mara is an opinionated Python ETL framework that works best for developers who are willing to abide by its guiding principles.
Python ETL with Xplenty
If you’re looking to perform ETL in Python, there’s no shortage of ETL Python frameworks at your disposal. But as your ETL workflows grow more complex, hand-writing your own Python ETL code can quickly become intractable—even with an established ETL Python framework to help you out.
Although Python ETL frameworks are a great help for many developers, they're not the right fit for every situation. None of the frameworks listed above covers every action you need to build a robust ETL pipeline: input/output, database connections, parallelism, job scheduling, configuration, logging, monitoring, and more. Even if you use one of these Python ETL frameworks, you'll still need an expert-level knowledge of Python and ETL to successfully implement, test, deploy, and manage an ETL pipeline all by yourself.
For these reasons, many developers are turning to Xplenty and other low-code ETL platforms. Xplenty comes with more than 100 pre-built integrations between databases and data sources, dramatically simplifying the ETL development process. Even better, for those who still want to use Python in their ETL workflow, Xplenty includes the Xplenty Python wrapper. This artifact allows you to access the Xplenty REST API from within a Python program. You can rely on Xplenty to do the ETL heavy lifting for you, and then build your own Python scripts to customize your pipeline as necessary.
The Python ETL frameworks above are all intriguing options—but so is Xplenty. For organizations that don't have the skill, time, or desire to build their own Python ETL workflow from scratch, Xplenty is the ideal solution.
Ready to get started building ETL pipelines with Xplenty? Get in touch with our team today for a 7-day free trial of the Xplenty platform.