Mining Dark Data without Hadoop

Mining Dark Data without Hadoop

Dark data does not come from the dark side of the Force—it’s just data that you’ve collected but haven’t used yet, such as social data or call center logs. How can you process this dark data? Well, the Plotting Success blog suggests mining it with Hadoop. According to them, Hadoop is great for the job because it doesn’t require you to organize the information and can also handle unstructured data such as text, images, audio, and video.

Hadoop is definitely a great solution for processing dark data, but what if you don’t know how to use it? Hadoop requires you to buy new hardware, provide expert maintenance, and hire developers to program MapReduce jobs. Luckily, there is an alternative—Xplenty.

Xplenty lets you mine dark data without any Hadoop hassle. You don’t need to buy servers or write any code—Xplenty’s visual designer allows you to create dataflows in your browser and run jobs with a few clicks. Don’t take our word for it—see for yourself.

Mining Social Dark Data with Xplenty

As mentioned above, social networks are one of the sources of dark data. For instance, if your sales are on the decline (hypothetically, of course), you can gather tweets that mention your company name and perform sentiment analysis to assess the situation.

Let’s do just that. We’ll run a naive sentiment analysis on tweets that we gathered during the Black Friday weekend in November 2012. The analysis will let us know how people were feeling by the minute during that blackest of Fridays.

The Data

Tweets are stored as a JSON objects in files on our private Amazon S3 account:

  "geo": null,
  "text": "going #blackfriday shopping while coming down with #bronchitis = not the best decision ever",
  "created_at": "Fri Nov 23 13:32:54 +0000 2012",
  "in_reply_to_status_id_str": null,
  "coordinates": null,
  "id_str": "271969576315666433",
  "retweeted": false,
  "in_reply_to_user_id_str": null,
  "in_reply_to_screen_name": null,
  "source": "web",
  "entities": {
    "urls": [],
    "hashtags": [{
      "text": "blackfriday",
      "indices": [6,
      "text": "bronchitis",
      "indices": [51,
    "user_mentions": []
  "in_reply_to_user_id": null,
  "in_reply_to_status_id": null,

SentiWordNet is stored as a CSV file:

a   00001740    0.125   0   able#1  (usually followed by 'to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"
a   00002098    0   0.75    unable#1    (usually followed by 'to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"
a   00002312    0   0   dorsal#2 abaxial#1  facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"

The schema:

  1. POS
  2. ID
  3. PosScore
  4. NegScore
  5. SynsetTerms
  6. Glossary


mining dark data

  1. tweets_source—loads tweets from our Amazon S3 bucket. Note that the green button at the top right can be used to auto-detect the schema and that the data is in JSON format.

    cloud sotorage source

  2. tweets_select—removes irrelevant fields and separates tweets into keywords using the Flatten(TOKENIZE(text)) functions. Also converts Twitter’s date/time string to a datetime data type with the ToDate(created_at, 'EEE MMM dd HH:mm:ss Z yyyy') function (Joda time formatting).


  3. sentiment_keywords_source—loads the sentiment dictionary from Amazon S3.


  4. select_sentiment_keywords—removes unnecessary fields and cleans sentiment keywords using the Flatten(TOKENIZE(REPLACE(synset_terms,'#\\d+',''))) functions. The REPLACE function removes hashtags and numbers from the keywords (e.g., changes ‘unable#1’ to ‘unable’). Since one database entry may contain several conjugations of the same keyword, the Flatten and TOKENIZE functions split the text by keyword.


  5. join_keywords—joins data from both sources by keyword. Left join is used so that keywords from tweets that are not available in the sentiment dictionary are still available following the join.

    join keywords

  6. aggregate_sentiment_score—calculates positive and negative scores per tweet. The date_time field is also kept for later use.


  7. select_score—calculates the final sentiment score by subtracting the negative score from the positive score. It also converts the date/time to time using the ToString(date_time, 'HH:mm') function.


  8. aggregate_date—counts the number of tweets and calculates the total and average sentiment score per minute.


  9. sort_results—sorts the results by time.


  10. output_s3—stores results on Xplenty’s Amazon S3 account.

    cloud storage destination


As the dark data analysis shows, sentiments toward Black Friday were slightly negative throughout the day. This analysis could provide useful insights to see why everyone was so negative and why, all of a sudden, the average sentiment suddenly spiked at 17:17. Either way, Xplenty can help you to mine dark data easily and without knowing anything about Hadoop.

Integrate Your Data Today!

Get a 7-day free trial. No credit card necessary.