MongoDB is a popular non-relational (a.k.a NoSQL) database. It is document-oriented and distributed in nature. MongoDB is known to be highly scalable and flexible. In this post, we'll demonstrate how you can utilize MongoDB in your ETL pipelines with Integrate.io. To start with, let us briefly discuss why and when you'd want to use MongoDB over other relational databases.

Table of Contents

  1. How MongoDB Differs From Relational Databases
  2. Benefits of Using Integrate.io with MongoDB
  3. Using MongoDB in an Integrate.io Pipeline
  4. Implementing the Pipeline
  5. Conclusion

How MongoDB Differs From Relational Databases

In a relational database, we model our data in terms of tables, columns, and relationships among them. In MongoDB, the information is instead represented as a collection of documents. These documents are like JSON objects and, by default, do not enforce any structure. This gives a lot of flexibility for data-modeling. 

One popular rule of thumb for deciding document structure in MongoDB is "what's used together should be stored together". We can store one-to-many (or many-to-many) relationships in array fields in the same document. This approach of "embedding" makes the queries highly performant and prevents the need for expensive JOIN operations. Because of the self-contained nature of the documents, MongoDB is also able to provide nice features of distributed databases like sharding, synchronization, and replication enable scalability across multiple machines.

Of course, this way of representing data requires more storage space, but the benefits may outweigh the cost depending upon your use-case.

 

Benefits of Using Integrate.io with MongoDB

There are several reasons why a business would choose to use Integrate.io with MongoDB:

  • Support for both directions of data flow: With Integrate.io, you can use a MongoDB connection in your pipeline to both as a data source and a data destination.
  • Compatibility with other components: You can compose very intricate pipelines combining existing features of Integrate.io and MongoDB components. For example, you can migrate your data from some other database to MongoDB. You can perform transformations on data. You can pull your data from other Integrate.io integrations like Salesforce. You can augment your data by calling external APIs. You can set your pipeline on a schedule and make it function on incremental data, and so on.
  • Security of data: Integrate.io supports SSL for database connections, so your data is transmitted over secured protocols. Moreover, the database resides in your premises, so you own your data.
  • Support of commonly used Mongo data types: Integrate.io supports all the commonly used MongoDB data types, including strings, number types (int, float, double), and BSON (binary JSON). 

In the next section, we'll see in detail how you can use MongoDB in your Integrate.io pipelines.

Using MongoDB in an Integrate.io Pipeline

 

To use any database in Integrate.io, there are two steps involved: setting up a database connection and adding the database component to the pipeline.

Setting up a MongoDB Connection

To use MongoDB in any of the Integrate.io pipelines, you'll first add a MongoDB connection in the Integrate.io dashboard. The MongoDB connection in Integrate.io supports SSL, which means that data transmission in and out of MongoDB happens over a secured protocol. 

Like most database services, there are two ways to set up MongoDB: managed or self-hosted. The self-hosted version involves installing the MongoDB binaries yourself on a machine, whereas in the managed version, you get an out-of-the-box setup that allows you to connect to the database right away. The managed offering provided by the MongoDB team is called MongoDB Atlas. It comes with some nice features like auto-scaling, monitoring, alerting, backups, and integration with other tools like Data Lake. 

Using the Database Components 

Integrate.io provides both a source component and destination component for MongoDB. This means you load data from and save data to a MongoDB connection in your pipeline. These components, combined with other features of Integrate.io like select transformations, filter transformations, curl functions, provide a lot of flexibility in terms of how you can shape the data. 

The source and destination components support the major MongoDB data types. The following table shows how these data types map to the data types native to Integrate.io:

MongoDB Data type

Integrate.io Data Type

32 Bit Integer

Integer

64 Bit Integer

Long

Double

Double

Boolean

Boolean

String

String

Object

Map

Array

Bag

The support for Object and Array data types is quite helpful. This data is stored as BSON (Binary JSON), and it makes it possible to do some advanced configurations like indexing and optimizing queries.

Example Use Case

An online commerce business maintains information about its customers in a relational database table (we'll use MySQL database for this example). The company wants to perform some analytics on the data to calculate various metrics for each of the customers. To do that, they'll first have to add customer orders data, which is available to them through a RESTful API. Once the order data is combined with customer information, it can be transformed according to the requirements and ingested in a MongoDB collection. This collection can then be used as the data source for any analytics tool/engine to perform analytics on top of it. 

To get the order details for a customer, we need to query the /orders resource endpoint via the POST HTTP method and provide the customer's email in the request payload. The JSON response returned by the API for the "Orders" endpoint for a particular customers looks something like this:

{
"orders": [
  ...
{
      "id": 29393,
      "total_price": 100,
      "tax_lines": [
        // These objects themselves are nested and so on.
      ]
      ...
    },
  {
      ...
    },
    ...
  ]
}

Here's how the complete response looks. 

 

Implementing the Pipeline

Now that our requirements are defined, let's see how we'd implement such a pipeline in Integrate.io.

thumbnail image

Adding the MongoDB Connection

To set up a MongoDB connection in the Integrate.io dashboard, make sure you have access to the following details from MongoDB:

  • Hostname
  • Username and password
  • Database name

For this example, we will use the details of a free cluster set up on MongoDB Atlas. If you're using Atlas, you can find the hostname when you click the connect button in your dashboard's clusters tab.

thumbnail image

Next, go to the Database Access tab for the user and password details. Create a new user, and take note of the username and password. Enter these values in New Connection configurations in Integrate.io. Be sure to check the "Connect using SSL'' option so that Integrate.io can connect to the Atlas cluster in a secured way. By default, Atlas cluster comes with a database named myFirstDatbase. If you want to use another database, you can create one via the Mongo shell. Finally, you can test the connection before saving to make sure everything works. 


thumbnail image

Integrate.io also supports some advanced configurations like specifying Read Preference, Authentication Database, and connecting directly to replica set members of the cluster. You can learn more about them in the Allowing Integrate.io access to MongoDB docs.

Adding a Connection to the Relational Database

If you want to use MongoDB as a source database, you must configure it separately. For this example, we will use a locally hosted MySQL database (exposed through a ngrok tunnel). 

thumbnail image

Setting up the ETL Pipeline

Now that we have data connections in place, we can start working on the pipeline.

Create a new dataflow package from the dashboard and give it a descriptive name like "Customers and Orders to MongoDB."

thumbnail image

The pipeline implementation has three components:

Let's take a look at each of these components in detail:

MySQL database source component: In the Input connection setting, we select the connection we created earlier. Then we provide the table name (customers in our case). After that, Integrate.io automatically pulls up the schema with a data preview. After selecting the fields we want to use, giving them an alias, and verifying their data types, we can save it.

thumbnail image

Select transformation component: In the select component, we'll add an expression that uses Integrate.io's Curl method to fetch the order details for a customer and assign the response body to a field.

thumbnail image

Since it might be inconvenient to construct the expression in the fields inside the dialog box, we can open the expression editor by clicking the pencil icon in the left and construct the expression there. As you can see, we constructed the JSON body for the request with TOMAP and ToJson functions:

thumbnail image

Database Destination component: Next, we'd want to ingest all of the data in a MongoDB collection. Here, the operation type has been specified as "Merge with existing data using the update and insert". You can check our docs to understand in detail what this operation does. The columns to write to the database can be populated using the Auto-fill option. We need to select the columns comprising the Primary Key or Composite Key that can uniquely identify an entry in the database table. If we want to store any of the map or bag fields as BSON instead of a string, we can check the BSON checkbox.

thumbnail image

Here's how the pipeline will look like upon completion:

thumbnail image

Pipeline in Action

Once implemented, we can validate and then execute the package by creating a job. The job status can be monitored from the Integrate.io dashboard.

thumbnail image

thumbnail image

Once the job is finished, we can verify if the data is saved correctly by running queries through a MongoDB shell.

thumbnail image

thumbnail image

Conclusion

 

In this post, we discussed using MongoDB to store data and went over the steps to add MongoDB to an Integrate.io pipeline. We connected to a managed MongoDB instance set up on Atlas over a secured protocol, combined additional data fetched from a REST API, and ingested it in MongoDB's native data types (arrays, objects, strings, etc.).

We touched only on the basic functionalities in this post, but there are more advanced things that you can do with MongoDB in Integrate.io. For more complex use-cases, refer to our Knowledge Base. If you have not yet tried Integrate.io for yourself, schedule a call with our Support team to find out if it's a good fit for your use case.