AWS Glue: Overview, Review, and Comparison

Amazon's AWS Glue service is "a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics". So why has Amazon released AWS Glue, and how is it expected to help enterprise users?

Big data is crucial for any forward-thinking organization that wants more valuable business insights to better serve its customers and outperform its competitors. Unfortunately, far too many organizations aren’t capitalizing on the wealth of information that they have at their fingertips.

According to Tech Pro Research, 50% of respondents report a lack of tools to feed downstream apps with the right information at the right time. 44% of the respondents report lacking the time to sort through it

To simplify enterprise data analytics and reporting, many businesses have installed a data warehouse: a data storage system that collects information from many sources within the organization. Of course, this still prompts the question of how to get information from far-flung databases into the centralized data warehouse.

The ETL process has been designed specifically to transfer information from its source database into a warehouse. However, the challenges and complexities of ETL can make it hard to implement successfully for all of your enterprise data.

For this reason, Amazon has introduced AWS Glue. In this AWS Glue overview, we’ll discuss everything you want to know about Glue: what it is, how it works, reviews, and a comparison with Glue alternatives.

What is ETL?

Extract, transform, load (ETL) is the predominant data integration process for loading information from one or more source databases into a target database or information warehouse. As the name suggests, it comprises three stages or functions:

Extract: The information is read and extracted from the source database(s) into a staging area.
Transform: The raw information is validated, checked for any data integrity issues, and transformed so that it matches the target database schema.
Load: The transformed information is loaded into the target database or data warehouse.

Having a well-designed ETL system is essential for data warehouses to unlock the insights contained within databases. ETL tools must address challenges such as correctly transforming the information between source and target, dealing with a wide variety of sources, and scaling to handle massive volumes of information.

The bad news is that many organizations haven’t been able to address these challenges and get the most out of their ETL implementation. According to a recent survey, 68% of respondents reported their analytics efforts are hampered by information siloing.

Seeing an opportunity to fill, services like AWS Glue have stepped in to fill the gaps. So what is it, and how does it help with organizations’ ETL challenges?

AWS Glue Overview

As described above, AWS Glue is a fully managed ETL service that aims to take the difficulties out of the ETL process for organizations that want to get more out of their information. The initial public release of Glue was in August 2017. Since that date, Amazon has continued to release updates with additional features and functionality. Some of the most recent updates include:

Support for Python 3.6 in Python shell jobs (June 2019).
Support for connecting directly to Glue via a virtual private cloud (VPC) endpoint (May 2019).
Support for real-time, continuous logging for jobs with Apache Spark (May 2019).
Support for custom CSV classifiers to infer the schema of CSV data (March 2019).

Glue fills a hole in Amazon’s cloud data processing services. Previously, AWS had services for data acquisition, storage, and analysis, yet it was lacking a solution for data transformation.

Under the hood is:

A Data Catalog, a metadata repository that contains references to the sources and targets that will be part of the ETL process.
An ETL engine that auto-generates scripts in Python and Scala for use throughout the ETL process.
A scheduler that can run jobs and trigger events based on time-based and other criteria.

The purpose of Glue is to facilitate the construction of an enterprise-class data warehouse. It can move information into the warehouse from a variety of sources, including transactional databases and the Amazon cloud.

According to Amazon, there are many use cases for Glue to simplify ETL tasks, including:

Discovering metadata about your various databases and data stores, and archiving them in the catalog.
Creating ETL scripts to transform, denormalize, and enrich the information while en route from source to target.
Automatically detect changes in your database schema and adjusting the service to match them.
Launching ETL jobs based on a particular trigger, schedule, or event.
Collect logs, metrics, and KPIs on your ETL operations for monitoring and reporting purposes.
Handling errors and retrying to prevent stalling during the process.
Scaling resources automatically to fit the needs of your current situation.

In the next section, we’ll explore some features and functionality that Glue offers.

AWS Glue Review: Overview, Features, and Functionality

The major features of Glue include:

Serverless Computing: It is a serverless offering, so you don’t have to manually designate a server to run it. Whenever you want to use its functionality, Amazon spins up a server for you and then shuts it down when it’s no longer in use. This automatic provisioning frees you from managing or scaling the infrastructure yourself.
Apache Spark: Glue is based on the Apache Spark analytics engine for information processing. However, the service also allows users to create scripts in Python and Scala.
Easy Development: Users who decide to manually write their ETL code have access to “developer endpoints”: environments in which you can develop and test your scripts.
Data Catalog: The Catalog is a metadata repository that stores information about all of your data stores and sources, giving you more visibility into your critical information regardless of location.
Job Scheduling: Glue makes scheduling easier by allowing you to start jobs based on an event or a schedule, or completely on-demand.

Downsides of AWS Glue

While AWS Glue is useful for a variety of use cases, some limitations may make it inadequate for adoption.

Limited Integrations

Integration options with AWS Glue are limited. As an AWS tool, it doesn’t integrate well with other technologies. It is limited in that it only has native connectors to JDBC and S3 which means organizations will need to utilize other methods to connect non-JDBC data sources.

Requires Specific Skill Set

As a relatively new technology, AWS Glue has a high learning curve. Implementing this framework requires expertise in serverless architecture which is still a new concept to many IT departments. AWS Glue runs on Apache Spark. As a result, developers must know Spark and the code needed to implement it such as Scala or Python.

Limited Database Support

Glue is lacking in support for traditional relational database types of queries. It only supports SQL-type queries. However, even that requires significant workarounds to accomplish anything beyond that.

Insufficient Testing Environment

Glue does not provide a test environment. Developers are forced to test their code on real data. Unfortunately, this can be a slow and tedious process — not to mention that live data could be negatively impacted if something were to go wrong.

Insufficient for Real-Time Data Processing

With Glue, all data is staged and processed at once. There is no functionality for incremental synching from the data source.

Lack of Documentation

Given the newness of AWS Glue, it is an evolving technology. There is limited documentation which could make using it challenging.

AWS Glue: Reviews and Alternatives

Since its general availability release in August 2017, AWS Glue seems to have been fairly well-received. On the business software review platform G2 Crowd, it has received an average rating of 4.0/5 stars, based on 30 reviews.

According to reviews on G2 Crowd, the positive features of Glue include its ability to simplify the data integration process. Reviewer Alkesh G. says that

I have been working with AWS Glue for 2 to 3 years. It allows you to locate, move and transform all your data sets across your business. The most interesting thing about AWS Glue is that it's serverless: you can run all your ETL jobs by just pointing Glue to them. You don't need to configure, provision or spin up servers, and you don't need to manage their life cycle.

However, some users also complain that AWS Glue has a steep learning curve, partially because of the lack of documentation and resources. One reviewer doesn’t hold back, saying:

The documentation and sample code around AWS Glue is horrible. Usually, I raise a support ticket to resolve my issues.

Another user says that it is

too new and not many tutorials or use cases are mentioned on the web, so it will take some time to use this in production.

AWS Glue Alternative: Integrate.io

For those not yet sold on this service, the good news is that it’s far from the only ETL service out there. Companies like Integrate.io offer alternatives for managing and simplifying the data integration process.

The Integrate.io platform offers a complete toolkit for constructing data pipelines from start to finish. Everything from simple replication tasks to advanced data preparation and transformation is made possible with Integrate.io’s easy-to-use, point-and-click user interface.

Included with the Integrate.io platform are integrations with over 100 different popular data stores and SaaS applications: MongoDB, MySQL, Amazon Redshift, PostgreSQL, Google Cloud Platform, Facebook, Salesforce, Jira, Magento, HubSpot, Slack, QuickBooks, and far too many others to list here.

Integrate.io drastically simplifies elastically scaling your data integration infrastructure. Increasing or decreasing the number of active nodes is as simple as adjusting a slider up or down.

On the G2 Crowd website, Integrate.io has received an average rating of 4.4/5 stars, based on 80 reviews. Thanks to this strong user feedback, Integrate.io has been ranked as one of G2 Crowd’s top performers for spring 2019.

Many Integrate.io users write positively about the ease of use and support when using Integrate.io, which is key when handling the complex ETL process.

According to reviewer Nick G:

Integrate.io links to most of the sources and destinations that we need. When there is no native connector, the REST API connector will achieve the result we want, and the support team are always ready to jump in and help if needed. I also like the fact that the support team and comprehensive documentation is often focused on helping you learn achieve the result you want, rather than doing the job for you. This has helped us leverage the learnings for other uses.

Integrate.io user Lally B. agrees, writing:

Integrate.io has excellent customer service. The team goes above and beyond to work with us to develop our data flows and answer any questions we have about the product in their real-time chat system.

Another user says that:

Before Integrate.io I had almost no experience with the ETL process, or data in general for the most part. Luckily their support team was fantastic and they were willing to walk me step by step through the convoluted mess that is data management.

AWS Glue Comparison: How Integrate.io Excels

Out of the box, Integrate.io offers multiple features that will help developers get up and running quickly.

Easy Data Transformations

As a low-code solution, Integrate.io features a drag-and-drop interface to build data transformations. Developers can build transformations such as sort, join, filter, and clone quickly without writing a ton of code. Those that want further customization options can use Integrate.io’s API to connect to other monitoring and reporting systems.

Simple Workflow Creation

Workflows automate the sequencing of tasks based on a set of conditions. Using Integrate.io, developers can set up dependencies between packages. They can then trigger packages automatically based on actions from another package.

Highly Flexible REST API

Integrate.io makes it easy to connect to unique or nonstandard data sources through a REST API connector. There is a virtually limitless variety of integration platforms and data sources to connect.

Robust Data Security and Compliance

Integrate.io adheres to the strictest security standards. Regardless of your industry or vertical, Integrate.io meets all requirements.

SOC2 Compliance: SOC 2 certification is a standard for third-party service providers to secure private customer data.

Firewall Access Control: By default, Integrate.io’s firewall denies access to all internal systems and external networks. It only grants access through protocols and ports that you specify.

Isolation of Customer Applications: Customer applications are separated from each other by using host-based firewalls.

EU and GDPR Data Privacy Compliant: Integrate.io meets one of the toughest data protection regulations in the European Union.

HIPAA and CCPA Compliant: Integrate.io meets all security requirements for protecting sensitive health information.

Conclusion

For many developers and IT professionals, AWS Glue has successfully helped them reduce the complexity and manual labor involved in the ETL process since its release in August 2017.

However, this AWS Glue comparison highlights the drawbacks, such as the newness of the service and the difficult learning curve, meaning that it’s not the right choice for every situation. Companies that are looking for a more well-established, user-friendly, fully managed ETL solution with strong customer support would do well to check out Integrate.io.

To learn more about whether Integrate.io is right for your organization, follow the Integrate.io blog for the latest news and updates, or get in touch with the Integrate.io team for a consultation.

Cloud Integration

AWS Glue:
Overview, Review, and Comparison

Table of Contents

What is ETL?

AWS Glue Overview

AWS Glue Review: Overview, Features, and Functionality