Amazon RDS and Integrate.io

How do you integrate data from Amazon RDS (Relational Database Service) with data from other sources such as S3, Redshift, or even MongoDB? The answer is Integrate.io. Our data integration on the cloud lets you join data from various sources and then process it to gain new insights.

What about storing the results back to RDS? No problem; Integrate.io does that as well.

In this tutorial, we'll provide an overview of RDS's and Integrate.io's services and then dive into managing your database instances and connecting the two platforms.

What Is Amazon RDS?

Amazon RDS, also known as the "Relational Database Service," makes it easier to scale your relational databases in the cloud. With an intuitive interface to help you set up and operate your databases, RDS is a cost-efficient solution that can resize to any capacity at a moment's notice. RDS also helps automate the time-consuming backend tasks, like setting up databases, provisioning hardware, patching software, and running backups.

With Amazon RDS, tech teams are able to free up time in their schedule to focus on what really counts, like the applications they are building. RDS gives tech teams faster performance, improved security, higher availability, and the compatibility they need to be able to work efficiently and effectively. Since it's available on several database instance types, your RDS setup can be optimized for performance, memory, or I/O. Plus, you can choose from six familiar engines, including Amazon Aurora, MariaDB, MySQL, Oracle Database, PostgreSQL, and Microsoft SQL Server.

What Is Integrate.io?

Amazon RDS is a powerful tool for setting up and operating your database in the cloud, but what's the best way to get your data into RDS? In truth, Amazon's own data integration system has its flaws. That's why more people are turning to Integrate.io. Integrate.io allows you to seamlessly connect all of your platforms, including Amazon RDS, to all of your business's data sources. From simply replicating data to performing complex preparation and transformation tasks, Integrate.io makes it easy with a point-and-click interface.

With Integrate.io, you can decide how you want to work. You can choose to execute packages using either the Integrate.io and Amazon RDS API or the UI, whichever is easiest for you. Your team can take back their time thanks to the ability to monitor job progress, schedule jobs for the future, and sample data outputs to ensure everything's connected, valid, and working as it should be. After all, your team's data management isn't just about the data, it's about the orchestration of the data flow. Integrate.io ensures that orchestration goes smoothly every day so you can sleep through the night.

What Is ETL?

Integrate.io is an ETL tool, meaning it helps you Extract, Transform, and Load data. Amazon has its own ETL service known as AWS Glue. Choosing the right ETL tool is essential as it will play a critical role in your data integration, allowing businesses to gather data across sources and consolidate it into a centralized location. The right tool will also help various types of data work seamlessly together.

You may use an ETL tool to collect and refine data from various sources and then deliver it to a warehouse, such as Amazon RDS. You can also use an ETL tool to migrate data across sources and tools, helping to improve business intelligence initiatives and broaden your data management strategies.

The first step in the process is the extraction step, which requires you to get data from a number of sources so you can combine and analyze it. The sources may include mobile devices; sales applications; existing databases; Customer Relationship Management (CRM) systems; on-premise, hybrid, and cloud environments; and so on. You can perform the data extraction manually, but it's a time-intensive process that's prone to manual errors.

After extracting data, the next step is to transform it, which is generally considered the most important part of the ETL process. While transforming the data, you can apply various rules and regulations to ensure accessibility. You can also cleanse, standardize, remove duplication, verify, and sort the data. All the transformation processes help improve data integrity and ensure that your transformed data loads into its destination in a ready-to-use state that will be most valuable to your business.

What Is AWS Glue?

Amazon offers its own data integration tool known as AWS Glue. It's a serverless, managed ETL service that helps businesses discover, prepare, and combine their data. However, the AWS Glue service, in particular, is known for having a handful of limitations. First off, Glue is built for batch processing, which means any business dealing with real-time applications and data flow should look elsewhere.

Delayed Data Readiness

Another shortfall comes in the form of the Glue Data Catalogs, which unnecessarily add extra steps to the querying process. The Crawler generates Data Catalogs whenever incoming datasets are processed and those catalogs help you find data and use it in various tools. So long as your catalog tables are in sync, you can move it across various S3 paths and other sources and query it with a tool that supports external schemas. Further, catalogs are organized into Databases and Tables, the latter of which maintains where data is kept, what Serialiser Deserialiser is to be used, and the scheme of the data.

To help you keep track of what data is stored in catalogs, AWS Glue keeps a column index. Unfortunately, if you use CSV files, you have to ensure the columns share an identical order in every CVS file or else Glue will begin putting data into the wrong columns and it will not alert you to any order changes.

The main problem, however, is that Glue catalogs are static and do not get updated automatically when the state of the storage system changes. This means that if you want to write a new partition for an existing data location, it's not available for querying because of Glue. Instead, you'll have to re-run the Crawler to make your data queryable using the Catalog table. This makes for an unnecessary step that delays data readiness.

Costly Startup Time

Users have reported a cold start time of up to 12 minutes per Glue Job. This is common in a serverless environment because the first job will always need to take time to set up the environment, which means subsequent jobs generally start quickly since they are reusing the environment. However, Amazon Glue comes with some unique caveats. First off, AWS will tear down the environment after about 10–12 minutes of non-use. Additionally, the environments are specific to each Glue Job. Multiple Glue Jobs cannot use the same environment.

With this in mind, you'll need to consider how the cold start time will impact your efficiency. If you have five Glue Jobs you want to run, for example, you can anticipate about 50 minutes as AWS allocates resources to build an environment for each job. This is no small factor, especially if you anticipate regularly running Glue Jobs and/or running many of them. Meanwhile, Glue gives you little control over the job environment, which can further slow things down. Glue uses DPU as the processing unit, where 1 DPU is 4 vCPU and 16GB RAM, and each DPU has 2 executors. This default configuration isn't ideal for many use cases.

Default Quotas

Previously referred to as "limits," Amazon has its own default quotas you'll need to consider and workaround. Many of these quotas vary across AWS regions and some can be increased, while others cannot. Moreover, these quotas also apply to Amazon IAM policy, or the Identity and Access Management system. Note that setting up the AWS identity and access management system properly is essential to using integrations. You'll need to use your Service Quotas console within AWS to increase any quotas that apply to your account.

Fortunately, these quotas rarely impact developer usage, at least not permanently. AWS states, "Requests up to the maximum allowed increase are automatically approved and are completed within a few minutes." Of course, if you foresee needing to exceed the maximum allowed increases for your work, that's something you should plan for well in advance and consult with AWS about. Also, remember that some web services quotas simply can't be changed at all.

Using Amazon RDS and Integrate.io

As you go through this tutorial, you'll notice the first step under "Connecting to RDS" requires you to configure your security policy in a way that enables Integrate.io to make a connection. In Amazon RDS, security groups are used to control the access of any inbound traffic for a DB instance. Amazon RDS enables the use of three types of security groups: the Amazon Virtual Private Cloud (VPC) security group, the DB security groups, and the EC2-Classic security groups. Each one controls network traffic for different setups across your database engines. Any tools you use must be setup to get past Amazon's firewall.

VPC Security Group: These control access to DB instances and EC2 instances inside Amazon VPC.
DB Security Group: These control access to EC2-Classic DB instances that are not in a VPC.
EC2-Classic Security Group: These control access to an Amazon EC2 instance.

To use Amazon RDS with Integrate.io, you'll need to configure a VPC security group in the appropriate manner. One of the key aspects is that you configure your database instance so Public Accessibility is set to Yes. You can confirm this within your AWS management console. The right network access control settings are fundamental to connecting RDS with Integrate.io and any other service.

With the most basic security settings, you'll simply need to provide Integrate.io access to the IP address and port of your database instance. You can further simplify things by using an IP address range to control access. Depending upon your preferences, you can also set up Amazon RDS to only accept SSL encrypted connections. You can find complete configuration instructions in the Integrate.io Knowledge Base. Integrate.io has both a written and video tutorial available using the RDS MySQL engine as an example, but these instructions will work with any Amazon RDS DB instance.

You can configure and re-configure your parameter groups and authentication rules at any time within the RDS console.

Connecting Integrate.io to RDS

First of all we need to connect to the database running on RDS before we can extract or store data:

Log in to the AWS account and then go to the RDS section.
Configure your RDS security policy so that Integrate.io will be able to connect to it. For full details, please see the Integrate.io knowledge base.
Open the relevant RDS instance and navigate to Databases (left-hand sidebar) > Connectivity & Security (top menu item). Copy the endpoint URL and port, and then make sure to set Public accessibility to Yes.
Navigate to Databases (left-hand sidebar) > Configuration (top menu item) and take note of the DB name and Master username.
Log in to Integrate.io and click the lightning bolt icon (⚡) in the top left corner to navigate to the Connections menu. In the top right corner, click the New Connection button.
Choose the connection type, either MySQL or PostgreSQL. Fill in the relevant fields and click Test Connection. If this doesn’t work, check your inputs. If the inputs are correct, check the security settings.
Click the green Create Connection button to save the connection.
Repeat the process if your destination database is different from the source database.

Integrating Data From RDS

In your Integrate.io dashboard, find the package icon in the left-hand sidebar. Open a new package and give it a name. Select the Dataflow type, leave the template option blank, and then press Create package.
While looking at your new package, click + Add component and select Database from the Source column.
Edit the component you just added by clicking on it. Select the database connection and fill in the corresponding fields. In step three (Select input columns), you can use the Select all button in the top right corner. Click Save.
Add additional sources such as MongoDB or Cloud Storage for S3.
Use a join component to integrate data from both sides.
Click Save to finish editing.

For an example of integrating a relational database with MongoDB, please see “How to Integrate MongoDB with Relational Databases”

Storing Data in RDS

Click the + button on the last component in the chain and select Database under Destinations.
Click the component.
Select the database connection (it may be different from the source connection) and fill in the fields.
On step three, Map input to target table columns, click the Auto-fill button in the top right.
Click Save to finish editing.

Get More Out of Your Data

With Integrate.io, you can easily combine data from Amazon RDS with data from other sources, process it and store the results back to RDS for further querying. Schedule a demo and see what it can do with your data.

big data integration

Integrate Amazon RDS With Other Data Sources

Table of Contents

What Is Amazon RDS?

What Is Integrate.io?

What Is ETL?