Integrating data from MongoDB and a relational database sounds like a major headache. On one hand, you have a schemaless NoSQL database containing JSON objects, and on the other, an SQL database with a fully defined schema. How can you easily integrate them? With Xplenty’s data integration on the cloud, of course!

In this tutorial, we will use Xplenty to perform a naive sentiment analysis by integrating data from MongoDB and a relational database. Our MongoDB database on MongoHQ stores a collection of tweets about Big Data. Our RDBMS is a MySQL database running on Amazon RDS, which is loaded with SentiWordNet, a publicly available sentiment dictionary.

For more information on Xplenty's native MongoDB connector, visit our Integration page.

Integrate Your Data Today!

Try Xplenty free for 14 days. No credit card required.


The Data

Tweets are stored as JSON objects on MongoDB:

  "created_at":"Tue Jul 02 09:33:13 +0000 2013",
  "text":"@tomassino how is the taste of #mongodb and #hadoop for breakfast?",
  "source":"<a href=\"\" rel=\"nofollow\">Twitter for Android</a>",
    "name":"Markus Schmidberger",
    "location":"Munich, Bavaria, Germany",

SentiWordNet is stored as a MySQL table:

 a  00001740  0.125  0  able#1  (usually followed by 'to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"
 a  00002098  0  0.75  unable#1  (usually followed by 'to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"
 a  00002312  0  0  dorsal#2 abaxial#1  facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"

The schema:

  1. POS

  2. ID

  3. PosScore

  4. NegScore

  5. SynsetTerms

  6. Gloss

Integrating MongoDB with RDBMS

The following Xplenty dataflow loads the data from both MongoDB and RDBMS, splits it into keywords, joins them from both sources, calculates the sentiment score per tweet, and finally, stores the results.

mongodb and rdbms integration dataflow

  1. mongo_source - loads Twitter data from MongoDB; in this case, from MongoHQ. Please see our previous post on how to connect Xplenty with MongoHQ.

    mongodb source

  2. mongo_select - removes unnecessary fields and separates tweets into keywords using the Flatten(TOKENIZE(text)) functions.

    mongodb select

  3. mysql_source - loads sentiment data from the relational database. The green button at the top right-hand side automatically detects the schema.

    rdbms source

  4. mysql_select - removes irrelevant fields and cleans sentiment keywords using the  Flatten(TOKENIZE(REPLACE(synset_terms,'#\\d+',''))) functions. The REPLACE function removes hashtags and numbers from the keywords (e.g. changes ‘unable#1’ to ‘unable’). Since one database entry may contain several conjugations of the same keyword, the Flatten and TOKENIZE functions split the text per keyword.

    rdbms select

  5. join_keywords - joins data from both sources by keyword. Left join is used so that keywords from tweets that are not available in the sentiment dictionary are still available following the join.

    join mongodb with rdbms

  6. aggregate_sentiment - calculates positive and negative scores per tweet. The user is kept only for reference.

    calculate positive and negative score per tweet

  7. selectsentimentscore - calculates the final sentiment score by subtracting the negative score from the positive score.

    calculate final score

  8. output_s3 - stores results on Xplenty’s Amazon S3 account.

    store output


Below is a distribution chart of sentiments for Big Data tweets. As you can see, the distribution quite evenly centers around the neutral score of 0, while sentiments proved to be slightly more positive than negative. Get a free Xplenty account and start integrating your data sources.