Introduction

Migrating analytics workloads to the public cloud has been one of the most important big data trends in recent years—and it shows no sign of slowing down.

According to a 2018 study by IT research company Forrester:

  • 51 percent of large enterprises are analyzing complex data in the public cloud; while
  • 72 percent of large enterprises are doing so within on-premises data centers.

By 2021, however, Forrester predicts that the fates will have reversed:

  • 61 percent of large enterprises will use the public cloud for complex data analysis; while
  • 44 percent of large enterprises will still use on-premises systems.

Of course, before data can be processed in the public cloud, it has to get there in the first place via data migration. Enterprises need a robust, mature data migration solution to deal with challenges such as data silos, increasing data volume and complexity, and security and compliance issues.

Azure Data Factory is a data migration service from the Microsoft Azure cloud computing platform that helps Azure users build ETL pipelines for their enterprise data. But with multiple options and configurations available for Azure Data Factory, which is right for your business? In this article, we’ll explore the question of using ETL with Microsoft and Azure Data Factory for analytic workloads in the public cloud.

Table of Contents

What is ETL?

ETL (which stands for extract, transform, load) is one of the most commonly used paradigms for the data integration process.

The three steps of ETL are:

  • Extract: First, data is extracted from a source location such as a file or database.
  • Transform: Next, the data is transformed from its source format in order to fit the target location’s schema.
  • Load: Finally, the transformed data is loaded into a target location such as a data warehouse, where it can be used for analytics and reporting.

The data you need for your analytics workloads may exist in many disparate forms and locations, both internal and external to your organization. For maximum efficiency, this data needs to be stored in a centralized repository, such as a data warehouse. ETL is a crucial part of the data migration process, making it easier and more efficient to integrate many different data sources.

On a final note: ETL is closely related to ELT, another data integration paradigm. ETL and ELT differ in the order in which they perform the “load” and “transform” stages. In other words, ELT transforms the data once it is already loaded into the data warehouse. ELT allows data professionals to pick and choose the data they want to transform, saving time when ingesting large quantities of unstructured information. 

What is Azure Data Factory?

Azure Data Factory is a fully managed data migration and integration service that enables Microsoft Azure users to bring together data from a variety of sources in the Azure public cloud.

The Azure Data Factory service allows users to integrate both on-premises data in Microsoft SQL Server, as well as cloud data in Azure SQL Database, Azure Blob Storage, and Azure Table Storage.

Once Azure Data Factory collects the relevant data, it can be processed by tools like Azure HDInsight (Apache Hive and Apache Pig). Azure Data Factory automates and orchestrates the entire data integration process from end to end, so that users have a single pane of glass into their ETL data pipelines.

According to Microsoft, Azure Data Factory is “more of an Extract-and-Load (EL) and Transform-and-Load (TL) platform rather than a traditional Extract-Transform-and-Load (ETL) platform.” Azure Data Factory is more focused on orchestrating and migrating the data itself, rather than performing complex data transformations during the migration.

In addition, Azure Data Factory is technically not a full ETL tool on its own: it defines control flows that execute various tasks, which may or may not act upon a data source. Until recently, however, Azure Data Factory did not include support for data flows that are responsible for directly migrating information.

What is SSIS?

Microsoft SSIS (SQL Server Integration Services) is an on-premises data migration and integration tool that is part of the Microsoft SQL Server database software.

SSIS first appeared in SQL Server 2005 as a replacement for Microsoft’s Data Transformation Services (DTS) toolkit. Before the introduction of Azure Data Factory, SSIS was the dominant tool for building data integration and transformation pipelines to and from an SQL Server.

With a wide range of capabilities, SSIS includes features such as:

  • Executing SQL statements
  • Collecting, cleansing, and merging data sources
  • Extracting data from sources such as databases (SQL Server, Oracle, Db2, etc.) and Excel spreadsheets
  • Defining ETL data sources and targets
  • User-friendly graphical tools and wizards

Despite the arrival of Azure Data Factory, SSIS isn’t expected to go away any time soon—you could even say that the two tools have a friendly rivalry. Newer versions of Azure Data Factory include the Integration Runtime, a feature that offers data integration capabilities across different network environments. In particular, this feature allows Azure Data Factory to execute SSIS packages (automated import and export pipelines between different data sources).

What are Mapping Data Flows?

With the shift to the public cloud, Microsoft has had to rethink its ETL and data migration offerings. SSIS is suitable for on-premises and IaaS (infrastructure as a service) workloads, but not for use in the public cloud. 

Mapping Data Flows are a new feature in Azure Data Factory that were made generally available in October 2019. With Mapping Data Flows, Azure Data Factory can become a complete ETL solution, combining both control flows and data flows to migrate information in and out of data warehouses.

By using Mapping Data Flows, Azure customers can build data transformations with an easy-to-use visual interface, without having to write lines of code. These data flows are then executed as activities within Azure Data Factory pipelines.

In the words of Mike Flasko, partner director of product management at Microsoft: “Data Factory now empowers users with a code-free, serverless environment [Mapping Data Flows] that simplifies ETL in the cloud and scales to any data size, no infrastructure management required.”

The convenience of Mapping Data Flows’ WYSIWYG environment offers Azure Data Factory users additional flexibility to develop data pipelines as best fits their needs, whether that means code-first or no-code. The ETL activities supported by Mapping Data Flows include:

  • Join
  • Aggregate
  • Pivot
  • Unpivot
  • Split
  • Lookup
  • Sort

Mapping Data Flows are an important step in resolving the well-documented data science shortage that’s currently plaguing the tech industry. “Citizen data scientists”—non-technical employees who nevertheless need access to data-driven insights—can use Mapping Data Flows to build ETL pipelines that simplify the data integration and transformation process.

Kamil Nowinski, one of Microsoft’s Data Platform MVPs (Most Valuable Professionals), has written about the transition from SSIS to Mapping Data Flows in an informative blog post, helping Azure customers translate their SSIS workflows into the new system.

Microsoft ETL: SSIS or Azure Data Factory's Mapping Data Flows?

With all that said, what’s the best way to do ETL in Azure Data Factory?

Mapping Data Flows are the newest way to perform ETL in Azure Data Factory, but they’re far from the only way. Executing SSIS packages from within Azure Data Factory is still a viable way to maintain your on-premises data workloads, thanks to Azure Data Factory's new Integration Runtime feature.

Both Mapping Data Flows and SSIS dramatically simplify the process of constructing ETL data pipelines. SSIS can run on-premises, in the cloud, or in a hybrid cloud environment, while Mapping Data Flows is currently available for cloud data migration workflows only.

Should you combine SSIS with Azure Data Factory? It depends on the specifics of your situation. Azure Data Factory is a robust tool for large volumes of data in the cloud, while SSIS is more lightweight and better suited for smaller jobs. Also consider whether it’s worth the hassle to use both technologies at the same time, making sure that they aren’t stepping on each others’ toes.

Azure Data Factory Alternatives

Despite its full feature set and positive reception, Azure Data Factory has a few important limitations. Most obviously, Azure Data Factory is largely intended for Azure customers who need to integrate data from Microsoft and Azure sources.

Azure Data Factory does have limited support for data from third-party sources, such as Amazon Redshift, MongoDB, and Salesforce. However, Azure Data Factory's list of connections and integrations is less extensive than other data integration solutions.

Xplenty is an ETL data integration platform that makes it easy to construct pipelines from all your ETL sources into a cloud data warehouse. With a simple drag-and-drop interface and more than 100 pre-built integrations, Xplenty enables you to build powerful, information-rich ETL workflows, so that you can start getting smarter business insights.

According to the business software review website G2, Xplenty has an average rating of 4.4 out of 5 stars. G2 has named Xplenty a "Leader for Fall 2019," as well as the number 1 cloud data integration solution.

CIO and G2 reviewer Stephen F. calls Xplenty an "excellent tool if you are looking for a powerful yet simple to use way to move your data... Their support is incredible. They are incredibly responsive and helpful and the people doing the support know their stuff."

Conclusion

Azure Data Factory is a robust and mature solution for integrating structured, semi-structured, and unstructured data from sources such as Microsoft SQL Server, Azure SQL Database, Azure Blob Storage, and Azure Table Storage. It also integrates well with Microsoft’s BI and analytics solutions, such as Power BI and Azure HDInsight.

While SSIS was once Microsoft’s preferred tool for building ETL data pipelines, it now has stiff competition from Azure Data Factory’s own Mapping Data Flows feature. Weigh the pros and cons of Mapping Data Flows and SSIS to decide which data migration solution is better for your purposes.

If you’re looking for a cloud data integration solution with greater range than Azure Data Factory, give Xplenty a try. The Xplenty data integration platform includes pre-built support for integrations with the most popular data stores and analytics tools, including Azure SQL Database and Azure Blob Storage.

Want to find out how Xplenty can help you build data-driven workflows and get cutting-edge business insights? Get in touch with our team of data integration experts for a chat about your needs and objectives.