Amazon's AWS Glue service is "a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics". So why has Amazon released AWS Glue, and how is it expected to help enterprise users?
Big data is crucial for any forward-thinking organization that wants more valuable business insights to better serve its customers and outperform its competitors. Unfortunately, far too many organizations aren’t capitalizing on the wealth of data that they have at their fingertips. According to a survey by PricewaterhouseCoopers (PwC), two-thirds of companies believe that they are getting “little tangible benefit” or “no benefit whatsoever” from their enterprise data.
In order to simplify the task of enterprise data analytics and reporting, many businesses have chosen to install a data warehouse: a data storage system that collects information from many different sources within the organization. Of course, this still prompts the question of how to get data from far-flung databases into the centralized data warehouse.
The ETL process has been designed specifically for the purposes of transferring data from its source database into a data warehouse. However, the challenges and complexities of ETL can make it hard to implement successfully for all of your enterprise data.
For this reason, Amazon has introduced AWS Glue. In this article, we’ll discuss everything you want to know about AWS Glue: what it is, how it works, reviews of the AWS Glue service, and a comparison with AWS Glue alternatives.
Integrate Your Data Today!
Try Xplenty free for 14 days. No credit card required.
What is ETL?
Extract, transform, load (ETL) is the predominant data integration process for loading information from one or more source databases into a target database or data warehouse. As the name suggests, it consists of three stages or functions:
- Extract: The data is read and extracted from the source database(s) into a staging area.
- Transform: The raw data is validated, checked for any data integrity issues, and transformed so that it matches the target database schema.
- Load: The transformed data is loaded into the target database or data warehouse.
Having a well-designed ETL system is essential in order for data warehouses to unlock the insights contained within databases. ETL tools must address challenges such as correctly transforming the data between source and target, dealing with a wide variety of data sources, and scaling to handle massive volumes of data.
The bad news is that many organizations haven’t been able to address these challenges and get the most out of their ETL implementation.
According to a survey by behavioral analytics company Interana, for example, nearly 70 percent of organizations regularly encounter questions about customer engagement that they’re unable to answer with their current tools.
Seeing an opportunity to fill, services like AWS Glue have stepped in to fill the gaps. So what is AWS Glue exactly, and how does it help with organizations’ ETL challenges?
What is AWS Glue?
As described above, AWS Glue is a fully managed ETL service that aims to take the difficulties out of the ETL process for organizations that want to get more out of their big data. The initial public release of AWS Glue was in August 2017. Since that date, Amazon has continued to actively release updates for AWS Glue with new features and functionality. Some of the most recent AWS Glue updates include:
- Support for Python 3.6 in Python shell jobs (June 2019).
- Support for connecting directly to AWS Glue via a virtual private cloud (VPC) endpoint (May 2019).
- Support for real-time, continuous logging for AWS Glue jobs with Apache Spark (May 2019).
- Support for custom CSV classifiers to infer the schema of CSV data (March 2019).
The arrival of AWS Glue fills a hole in Amazon’s cloud data processing services. Previously, AWS had services for data acquisition, storage, and analysis, yet it was lacking a solution for data transformation.
Under the hood of AWS Glue is:
- The AWS Glue Data Catalog, a metadata repository that contains references to data sources and targets that will be part of the ETL process.
- An ETL engine that automatically generates scripts in Python and Scala for use throughout the ETL process.
- A scheduler that can run jobs and trigger events based on time-based and other criteria.
The purpose of AWS Glue is to facilitate the construction of an enterprise-class data warehouse. Information can be moved into the data warehouse from a variety of sources, including transactional databases as well as the Amazon cloud.
According to Amazon, there are many possible use cases for AWS Glue to simplify ETL tasks, including:
- Discovering metadata about your various databases and data stores, and archiving them in the AWS Glue Data Catalog.
- Creating ETL scripts in order to transform, denormalize, and enrich the data while en route from source to target.
- Automatically detecting changes in your database schema and adjusting the service in order to match them.
- Launching ETL jobs based on a particular trigger, schedule, or event.
- Collecting logs, metrics, and KPIs on your ETL operations for monitoring and reporting purposes.
- Handling errors and retrying in order to prevent stalling during the process.
- Scaling resources automatically in order to fit the needs of your current situation.
In the next section, we’ll explore some of the features and functionality that AWS Glue has to offer.
AWS Glue: Features and Functionality
The major features of AWS Glue include:
- Serverless computing: AWS Glue is a serverless offering, which means that you don’t have to manually designate a server to run it. Whenever you want to use AWS Glue functionality, Amazon spins up a server for you, and then shuts it down when it’s no longer in use. This automatic provisioning frees you from the task of managing or scaling the infrastructure yourself.
- Apache Spark: AWS Glue is based on the Apache Spark analytics engine for big data processing. However, the service also allows users to create scripts in Python and Scala.
- Easy development: Users who decide to manually write their ETL code with AWS Glue have access to “developer endpoints”: environments in which you can develop and test your AWS Glue scripts.
- AWS Glue Data Catalog: The AWS Glue Data Catalog is a metadata repository that stores information about all of your data stores and sources, giving you more visibility into your data assets regardless of location.
- Job scheduling: AWS Glue makes the task of scheduling easier by allowing you to start jobs based on an event or a schedule, or completely on-demand.
AWS Glue: Reviews and Alternatives
Since its general availability release in August 2017, AWS Glue seems to have been fairly well-received. On the business software review platform G2 Crowd, AWS Glue has received an average rating of 4.0/5 stars, based on 30 reviews.
I have been working with AWS Glue for 2 to 3 years. It allows you to locate, move and transform all your data sets across your business. The most interesting thing about AWS Glue is that it's serverless: you can run all your ETL jobs by just pointing Glue to them. You don't need to configure, provision or spin up servers, and you don't need to manage their life cycle.
However, some users also complain that AWS Glue has a steep learning curve, partially due to the lack of documentation and resources. One reviewer doesn’t hold back, saying:
Another user says that AWS Glue is
Integrate Your Data Today!
Try Xplenty free for 14 days. No credit card required.
AWS Glue Alternative: Xplenty
For those not yet sold on the AWS Glue service, the good news is that it’s far from the only ETL service out there. Companies like Xplenty offer alternatives to AWS Glue for managing and simplifying the data integration process.
The Xplenty platform offers a complete toolkit for constructing data pipelines from start to finish. Everything from simple replication tasks to advanced data preparation and transformation is made possible with Xplenty’s easy-to-use, point-and-click user interface.
Included with the Xplenty platform are integrations with over 100 different popular data stores and SaaS applications: MongoDB, MySQL, Amazon Redshift, PostgreSQL, Google Cloud Platform, Facebook, Salesforce, Jira, Magento, HubSpot, Slack, QuickBooks, and far too many others to list here.
Like AWS Glue, Xplenty drastically simplifies the task of elastically scaling your data integration infrastructure. Increasing or decreasing the number of active nodes is as simple as adjusting a slider up or down.
On the G2 Crowd website, Xplenty has received an average rating of 4.4/5 stars, based on 80 reviews. Thanks to this strong user feedback, Xplenty has been ranked as one of G2 Crowd’s high performers for spring 2019.
Many Xplenty users write positively about the ease of use and support when using Xplenty, which is key when handling the complex ETL process.
According to reviewer Nick G:
Xplenty links to most of the sources and destinations that we need. When there is no native connector, the REST API connector will achieve the result we want, and the support team are always ready to jump in and help if needed. I also like the fact that the support team and comprehensive documentation is often focused on helping you learn achieve the result you want, rather than doing the job for you. This has helped us leverage the learnings for other uses.
Before Xplenty I had almost no experience with the ETL process, or data in general for the most part. Luckily their support team was fantastic and they were willing to walk me step by step through the convoluted mess that is data management.
For many developers and IT professionals, AWS Glue has successfully helped them reduce the complexity and manual labor involved in the ETL process since its release in August 2017.
However, the drawbacks of AWS Glue, such as the newness of the service and the difficult learning curve, mean that it’s not the right choice for every situation. Companies that are looking for a more well-established, user-friendly, fully managed ETL solution with strong customer support would do well to check out Xplenty.