What do Xplenty, Snowflake, and dbt have in common? When used together, they merge the best of ETL and ELT into a powerful, flexible, and cost-effective ETLT strategy.
In this guide, we’ll show you how to create an ETLT strategy with Xplenty, Snowflake, and dbt. But first, we’ll explain the need for such a strategy, what dbt is, and why you might want to incorporate Xplenty Snowflake, and dbt into an ETLT data transformation stack.
This piece will cover a lot of ground, so feel free to use the following links to navigate:
- Why ETLT With Xplenty, Snowflake, and DBT?
- How to Create an ETLT Strategy with Xplenty, Snowflake, and dbt
- Final Thoughts on ETLT with Xplenty, Snowflake, and DBT
Why ETLT With Xplenty, Snowflake, and dbt?
Let’s start with a birds-eye view of ETLT, Xplenty, Snowflake, dbt, and the benefits of using them:
To understand the benefits of ETLT, we need to start with some background information on the advantages and limitations of ELT (extract, load, transform) data integrations.
In recent years, ELT (extract, load, transform) has grown more popular than ever – and for some use-cases, it’s outpacing traditional ETL (extract, transform, load). This is primarily due to the emergence of cloud data warehouse platforms like Snowflake. These data warehouses are so powerful and efficient that they can quickly and cost-effectively transform data within the warehouse after loading.
Before cloud data warehousing solutions like Snowflake, carrying out post-load transformations on a legacy onsite data warehouse was slow, impractical, and expensive – but now, post-load ELT transformations are easier and more affordable to achieve. The primary advantage of ELT is that it skips the initial staging area of ETL, bypasses the need to design and carry out preload transformations, and loads raw data directly into the data warehouse. This lets you ingest and save massive amounts of unstructured data fast. It also offers the flexibility to structure (and restructure) data later according to your unique data analytics requirements.
However, there are serious limitations to an ELT-only strategy. These limitations are primarily compliance and data security-related. Industry-specific compliance rules often require the masking, encryption, or removal of sensitive data before loading it into a data warehouse. Even if you are not beholden to such restrictions, you might want to protect and secure your data before sending it to the warehouse to prevent data breaches, boost security, protect your organizational secrets, and build client and customer trust.
In this respect, even if you want the benefits of ELT, you might be forced to add a pre-load transformation layer (ETL) to the workflow. In such situations, most organizations simply stick with ETL and decide that the benefits of ELT are not available to them. However, there exists a third approach to data integration that merges the best of both worlds: ETLT.
With an ETLT workflow, you perform the minimum-required preload transformations to adhere to any applicable data compliance standards. Then you load the data into the data warehouse and use the processing power of the data warehouse to manage the rest of the transformations.
Let’s review the most significant benefits of an ETLT data integration strategy:
- Compliance: Allows you to meet compliance and data security requirements that require preload data transformations while still benefiting from ELT’s fast data ingestion and the flexibility to change your business logic as required.
- Data masking Permits the pseudonymization of data via preload data masking so that you can adhere to compliance standards that prohibit sensitive PII (personally identifiable information), PHI (protected health information), and IIHI (individually identifiable health information) from being tied to any specific person. These compliance concerns may apply to you under the European Union’s GDPR standards, HIPAA, or another standard.
- Customer trust: By elevating your levels of data security, ETLT helps build customer trust while allowing you to benefit from ELT flexibility.
- Better security: Through greater security, ETLT reduces the threat of damage from privacy breaches, hacks, and other security-related failures.
- Faster ingestion: Speeds up data ingestion while still maintaining the strictest security standards.
For more information on ETLT, make sure to read our recent article on this best-of-both-worlds approach to data integration.
Xplenty is an ETL-as-a-Service tool that offers a wide range of immediate connections for virtually any data source, cloud-based service, database, or onsite system. Once connected to the source, Xplenty extracts your data into the Xplenty staging area. From there, Xplenty performs any number of data transformations that you set up via an easy-to-use, visual interface, and loads the data into the data warehouse.
When using Xplenty in an ETLT process, you will perform only a minimum number of lightweight transformations within Xplenty – such as masking, removing, or encrypting sensitive data (PHI and PII) for compliance purposes. By keeping preload transformations simple, you can load your data into the data warehouse as quickly as possible. More complex transformations (like JOINing tables, data enrichment, etc.) will happen in the data warehouse itself.
The three primary benefits of using Xplenty for your preload transformations are:
- Speed: Xplenty’s wide range of source and destination connections and it’s ready-made transformations allow you to develop an ETL workflow in minutes. Plus, as a cloud-native and fully-hosted solution, your Xplenty instance is ready to start building ETL pipelines from the minute you first log into the platform.
- Ease-of-Use: Xplenty was designed with ease-of-use in mind. The platform is so easy to operate that anyone can use it to develop ETL workflows. If you’re a complete ETL beginner, your dedicated Xplenty integration specialist is always available to help you design the integration pipelines your use-case requires.
- Affordability: When you use Xplenty, you pay an affordable, flat-rate, monthly fee based on how many connectors you’re using. You don’t pay extra for sending more data through the pipeline, and you don’t pay extra for processing in-pipeline transformations. This makes for an affordable, predictable pricing structure.
TRUSTED BY COMPANIES WORLDWIDE
Enjoying This Article?
Receive great content weekly with the Xplenty Newsletter!
Among the wide array of cloud data warehousing options – like Redshift, BigQuery, and Azure SQL Data Warehouse – why would you want to choose Snowflake to pair with dbt? Mani Gandham offers this concise explanation of what makes Snowflake special:
“[Snowflake] offers a unique blend of features that sit between all of these existing options. It stores data in cloud storage for easy cloning, replication, and time-travel/snapshot queries unlike the others. It separates compute from storage like BigQuery for faster scaling and unlimited concurrency. It still has a concept of a cluster of servers like Redshift/Azure SQL DW but simplifies management to just size. It supports low-latency queries similar to an always running cluster like Redshift while scaling up to unlimited datasets like BigQuery. Another major advantage is much better JSON and unstructured data handling along with a very fast and powerful UI console.”
dbt is a free, open-source product that performs post-load transformation within the data warehouse itself. By allowing you to program any number of data transformation processes in SQL, dbt is a powerful solution for building in-warehouse data pipelines.
Here’s how the dbt team describes the solution:
“dbt is the T in ELT. It doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse. This “transform after load” architecture is becoming known as ELT (extract, load, transform).”
“dbt is a tool to help you write and execute the data transformation jobs that run inside your warehouse. dbt’s only function is to take code, compile it to SQL, and then run against your database.”
The dbt team also adds the following in their documentation:
“dbt (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views.”
“dbt also enables analysts to work more like software engineers, in line with the dbt viewpoint.”
After using Xplenty to perform necessary transformations for data compliance, and loading the data into Snowflake. The final state of your ETLT process uses dbt to perform the rest of the transformations that PIVOT, aggregate, JOIN, SELECT, and GROUP BY – as required by the specific type of data analysis decision-makers require.
Ultimately, this ETLT method preserves your ability to rapidly load large volumes of data into Snowflake by keeping your Xplenty transformations as simple as possible. Then it uses dbt to perform transformations within Snowflake according to the needs of your use-case.
How to Create an ETLT Strategy with Xplenty, Snowflake, and dbt
As we referenced above, the first step in your ETLT strategy involves Xplenty, which extracts data from the source and performs basic transformations that remove/mask/encrypt PHI and PII data. After that, Xplenty loads the data into Snowflake. From there, dbt manages the final data transformations within Snowflake according to your needs.
You’ll find detailed instructions for setting up the Xplenty side of your ETLT equation here – and if you ever get stuck with Xplenty, your dedicated Xplenty specialist is only just a phone call away.
After using Xplenty to transform and load the data into Snowflake, it’s time to develop the rest of your transformation workflow with dbt. For this stage, you’ll need to have some general SQL skills.
However, before we dive into an example of how to create a dbt pipeline, it will help to understand a few things about dbt’s features and components.
dbt: Features and Components
Here are some general characteristics of dbt:
- Allows ELT developers to create transformation workflows coded as SQL. The SQL queries are “models” within a transformation “package.”
- dbt models created by a SQL statement. They are simply a table, i.e., a view of the data found in the data warehouse (Snowflake in this case).
- dbt lets you version your SQL query models for testing purposes.
- Each model features reference links to create a DAG (Directed Acyclic Graph), which saves the query order and data lineage as it is created via the models.
As for its components, dbt is made up of three different file types: SQL (.sql), YAML (.yml), and MARKDOWN (.md). The .sql files save the models and tests, like the SQL statements that create the table/view of the data you are transforming in the warehouse. Some .sql files could be ephemeral and only persist when executing the pipeline. The .yml files store configurations for the dbt project in a project configuration file, and they store configuration and testing information in a schema file. The .md files are optional. They store the documentation for your models.
Installing dbt and Understanding Available Modules
dbt recommends installing their solution via pip, and they prefer using pipenv. After installation, you’ll use ‘dbt init’ to set up a new project. The dbt project you create can include different types of modules such as models, analysis, tests, docs, macros, etc.
Here’s a description of the most common dbt modules:
- Models: A directory that includes .sql files that form your ELT pipelines. Executed in sequence, these form the transformation pipeline.
- Tests: A directory that includes .sql files that execute tests on the data. You can create a test model that checks to ensure specific values meet your exact specifications and any number of custom tests.
- Macros: A directory that includes .sql files created with Jinja. According to dbt documentation: “In dbt, you can combine SQL with Jinja, a templating language. Using Jinja turns your dbt project into a programming environment for SQL, giving you the ability to do things that aren't normally possible in SQL.” As macros, these are intended to be used again and again, and you can combine them to create more complex processes.
- Docs: A directory that includes optional documentation formatted into .md files.
- Logs: A directory that includes run logs. dbt creates and saves run logs when you execute the “dbt run” command.
- Target directories: dbt creates target directories when you compile, build, or run documentation. These hold metadata and the SQL code that has been compiled.
- Analysis: A directory includes ready-to-go SQL queries that you can use at any time to provide different views of your data for data analytics purposes. You may have a variety of these analyses saved for use any time on your data, but they are not part of the project's data transformation workflow.
You can learn more about these and other modules in the dbt documentation.
TRUSTED BY COMPANIES WORLDWIDE
Enjoying This Article?
Receive great content weekly with the Xplenty Newsletter!
Setting up dbt on Snowflake
Setting up dbt on Snowflake requires putting the profiles.yml file in the ~/.dbt directory of the machine that is executing the process. You’ll also need to include the following ‘sf’ profile information in the dbt_project.yml profile field after filling it out with your unique information:
Develop Your Data Transformation Pipeline
Now you’re ready to create a data pipeline. Snowflake offers an excellent example of how to do this with dbt by using a string of transformations from a publically available dbt/Snowflake demo, which offers the TPC-H benchmark datasets. In the following example, we’ll see how John L. Aven and Prem Dubey achieved their desired analytics view of these datasets by aggregating:
- Average Account Balance for all Customers in a Nation
- Average Available Quantity for each Product, regardless of Supplier
- Average Supply Cost for each Product, regardless of Supplier
Aven and Dubey aggregated this information into a table with the following columns:
- Customer Name
- Customer Account Balance
- Market Segment
- Part Available Quantity
- Part Name
- Part Manufacturer
- Part Brand
- Part Type
- Part Size
- Supplier Account Balance
- Supplier Name
- Supply Cost for each Part and Supplier Pair
Next, they chose the following series of SQL transformations to achieve their desired analytics view. By choosing to filter the data through sequential transformations like this (as a package), it’s easier to keep your bearings and make sure that you haven’t made any mistakes. Here’s the pipeline:
1) supplier_parts: JOIN three tables (PART, SUPPLIER, PARTSUPP) on the PARTKEY field and the SUPPKEY field. This step creates an ephemeral table.
2) Average_account_balance_by_part_and_supplier: Compute the average aggregation for the supplier account balance by part and supplier. This step creates an ephemeral table.
3) average_supply_cost_and_available_quantity_per_part: Calculate the average aggregation of the supply cost and available quantity, calculating by part for the supplier. This step creates an ephemeral table.
4) supplier_parts_aggregates: JOIN the aggregations to the supplier_parts table. This step creates a permanent new table.
5) customer_nation_region: JOIN the three tables (CUSTOMER, NATION, REGION) by the NATIONKEY. This step creates an ephemeral table.
6) average_acctbal_by_nation: Calculate the average_customer_account_balance over the nation field. This step creates an ephemeral table.
7) customer_nation_region_aggregates: JOIN the average_customer_account_balance to the customer_nation_region table. This step creates a permanent new table.
8) analytics_view: JOIN customer_nation_region_aggregates to supplier_parts_aggregates. This is for a BI Dashboard.
That’s the end of your transformation pipeline! Keep in mind that each model includes reference links to generate a DAG (Directed Acyclic Graph), so dbt makes it easy to see a visual graph that shows the sequence and logic behind your SQL transformations.
Now that you’ve seen how to design a dbt data transformation pipeline, the next step involves connecting this data to any number of data analytics and BI solutions to perform analyses, create graphs, and display and share your data with team members and decision-makers.
Integrate Your Data Today!
Try Xplenty free for 14 days. No credit card required.
Final Thoughts on ETLT with Xplenty, Snowflake, and DBT
It would take a dozen or more articles to cover this topic in full, so we hope this guide has given you a general understanding of the what, why, and how behind creating an ETLT data integration strategy with Xplenty/Snowflake/dbt. We also hope that you’ll consider Xplenty for your next data integration project.
As an enterprise-grade, ETL-as-a-service solution, Xplenty can serve as a powerful yet easy-to-use addition to your ETLT pipeline. Whatever your data integration use-case happens to be – whether it’s ETL, ELT, or ETLT, the Xplenty team is always available to help you meet your goals. Contact Xplenty for a free trial now!