Processing JSON Data on the Cloud

Processing JSON Data on the Cloud

We’ve processed plenty of JSON data in our blog – from Tweets to GitHub commits – but we’ve never really discussed how to process JSON with Xplenty’s data integration on the cloud. Let's do so right now.

What is JSON?

Just so that we’re all on the same page, JSON, or JavaScript Object Notation, is the most common format for sending object data as strings. Here’s an example Twitter object in JSON format:

{
    "created_at": "Tue Jul 02 09:37:57 +0000 2013",
    "id": 351998165886566400,
    "id_str": "351998165886566400",
    "text": "5 raisons pour passer à #BigData http://t.co/rKdN4Y9Kmm",
    "source": "web",
    "truncated": false,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "favorite_count": 0,
    "entities": {
        "hashtags": [{
            "text": "BigData",
            "indices": [24,
            32]
        }],
        "symbols": [],
        "urls": [{
            "url": "http://t.co/rKdN4Y9Kmm",
            "expanded_url": "http://ow.ly/mz4bZ",
            "display_url": "ow.ly/mz4bZ",
            "indices": [33,
            55]
        }],
        "user_mentions": []
    },
...

JSON is commonly used to send data between servers and web applications, for example, from Gmail servers directly to your browser. It’s a schema-less format and it’s leaner, meaner, and way more popular than its XML brother. MongoDB, for instance, supports JSON right off the bat.

Processing JSON Data with Xplenty

Handling JSON data usually requires extra DB support or custom code. However, we like to KISS (Keep It Simple, Stupid) here at Xplenty and give everybody the power to process JSON data without DB installations or going back to college for a programming course.

Let’s start at the source. When loading JSON data via Xplenty, the source type should be set to “JSON” in the source component. The fields and types can then be filled in automatically by clicking the green auto-detect schema button at the top right. Here’s an example for loading Twitter data in JSON from our public S3 bucket:

json source

Now that we have the JSON data, it can be referenced in the rest of the dataflow:

  1. JSON data type — if the field is already mapped to a JSON data type, use the hashtag followed by the field name in single quotes. e.g., retweeted_status#'id'. Note that it returns a string. If it contains some other data type, casting may be necessary to convert it to the right type, e.g. (int)user#'followers_count'.

  2. JSON string — if the field contains a JSON object as a string, and you would like to access the object’s properties, it needs to be mapped to a hashmap via the JsonStringToMap function, or JsonStringToBag in case of an array, e.g. JsonStringToMap(urls)#'expanded_url' or JsonStringToBag(entities#'urls'). To access a deeply nested JSON property in a string, you can use the JsonExtract function with the relevant JSONPath, e.g. JsonExtract(retweeted_status, '$.user.screen_name').

  3. JSON array — if the field is an array type, or the JsonStringToBag function has been used, and you want to process each array item, wrap it with the Flatten function, e.g. Flatten(JsonStringToBag(entities#'urls')). Other functions for handling bags are also available.

Let’s look at an example that puts it all together.

JSON Example Dataflow

The following dataflow analyzes Twitter data to find out which URLs mentioned in tweets gained the most potential exposure:

json dataflow

  1. tweets_source — loads the data (see screenshot above).

  2. select_json_fields — selects JSON fields and extracts URLs via the Flatten(JsonStringToBag(entities#'urls')) functions.

    select

  3. select_final_fields — converts the followers field to an integer and retrieves the expanded URLs via the JsonExtract(urls, '$.expanded_url') function.

    select

  4. aggregate_followers — sums up the total exposure for each URL, counts the number of tweets that contain them, and also collects the user names of the users who tweeted them.

    aggregate

  5. sort_by_follwers — sorts the data by exposure in descending order.

    sort

  6. store_results — stores the output back to Amazon S3.

    store

Results

We ran the dataflow using Big Data tweets that we collected and found the most highly exposed URLs. You can process your own JSON data, whether it’s from Twitter, MongoDB, or anywhere else. Just sign up with Xplenty for free and get started right away.


Integrate Your Data Today!

Try Xplenty free for 7 days. No credit card required.