The 2014 Twitter Olympics: Analyzing Sochi Tweets With Hadoop-as-a-Service

The 2014 Twitter Olympics: Analyzing Sochi Tweets With Hadoop-as-a-Service

(Image by { QUEEN YUNA }, Some rights reserved)

The Sochi 2014 Winter Olympics were the most celebrated winter games on the net. Their official website had more visitors in one night than Vancouver 2010 had during the entire games and in the first five days alone there were 2.2 million tweets mentioning the #Sochi2014 hashtag. Following our own post on integrating social data with Big Data we decided to jump on the bobsled and perform a Twitter analysis of the Sochi 2014 Winter Olympics using our very own Xplenty Hadoop-as-a-Service.


The data was collected via a prepaid package on DataSift from February 13, 2014 up to the closing ceremony on February 23rd with two interruptions, so a bit of data may have been lost. Only tweets with the following keywords and hashtags were collected: sochi, sochi olympics, winter olympics 2014, #Sochi, #WPXIOlympics, #WinterGames, #Sochi2014 and #cnnsochi. In total, we collected 3 GB or 2,245,301 Sochi tweets. The data was processed via the Xplenty Hadoop-as-a-Service platform using a free sandbox cluster. Results can be accessed here as well as in the embedded spreadsheets below.

Sentiment Analysis

Did people like the Sochi Winter Olympics? There was quite a lot of debate before the event about Russia’s political policies as well as #SochiProblems including the famous toilets. We filtered 1,347,185 tweets that were in English and compared them against a dictionary of positive, neutral, and negative keywords. It’s a naive way of doing sentiment analysis and not exact science, but it can paint some kind of picture to what’s going on.

The results:

  1. Positive - 444,307 tweets - 33%

  2. Neutral - 99,856 tweets - 7%

  3. Negative - 139,289 tweets - 10%

Where did the remaining 50% go? 663,733 tweets in English had an unknown sentiment because they didn’t match against any of dictionary keywords.

Top Keywords

What were the top keywords in tweets about Sochi? We eliminated stop keywords such as rt, a, about, etc. and ran an analysis.

The obvious winners:

  1. Sochi2014 - 1,117,837 matches

  2. Sochi - 927,809 matches

  3. Olympics - 403,986 matches

The countries most mentioned were Canada followed by Russia and the USA. Coincidentally, these were the countries with the biggest number of qualifying athletes, although the US wasn’t in the top three ranking nations by medal (Russia, Norway, and Canada with the US coming in fourth).

Besides the regular suspects like gold and hockey, Pussy and Riot ranked pretty high with 51,443 and 50,645 matches respectively (I guess the former was used in other contexts). It came as no surprise since two members of the group were released from prison on 18.2.14, and protested in Sochi where they were flogged by Cossacks. That’s why we saw a Pussy Riot spike on that date with 15,246 tweets mentioning Pussy Riot. Either way, they beat Putin who only had 25,324 matches in Sochi tweets.

Popular Sports

To find out which sport was most popular, we made a keyword list of the 15 Winter Olympic events, including variations (hockey, icehockey, ice hockey, etc.), and checked how many matches they had in the collected tweets.

The winners:

  1. Hockey - 198,692 matches

  2. Ski - 64,949 matches

  3. Curling with 58,108 matches

Which kind of ski did that refer to? Hard to say. What we do know is that in these cases the keyword ski appeared by itself without other keywords from the sports list, so it couldn’t have been alpine skiing for instance which had 2,064 matches.

Most Tweets

Wonder who went crazy for Sochi? We checked out which users wrote the most Sochi tweets.

The winners:

  1. sochi2014newsen - 3,928 tweets

  2. Torontonia - 3,541 tweets

  3. Sochi_2014_News - 2,280 tweets

Despite the tweeting madness, these users had a decent though not massive following of 28, 581, and 233 followers respectively.

(Image by Atos International, Some rights reserved)

(Image by Atos International, Some rights reserved)

Top Retweets

This analysis was a bit trickier since.  No retweet data was received from DataSift, so we extracted tweets that contain RT followed by @username. We summed up the total number of followers for the users who retweeted these tweets in order to calculate how much potential exposure the original users may have gained by being retweeted.

The winners:

  1. NBCOlympics - 572 retweets - 4,298,451 potential exposures

  2. Sochi2014 - 358 retweets - 1,946,200 potential exposures

  3. NHLonNBCSports - 253 retweets - 3,352,886 potential exposures

Ranking users by exposure we got different results:

  1. AP_Sports - 20,263,219 exposures - 182 retweets

  2. guardian_sport - 6,861,787 exposures - 29 retweets

  3. NBCOlympics - 20,263,219 exposures - 572 retweets

We went on to calculate who got the most exposures per retweet:

  1. guardian_sport - 236,613 potential exposures/retweet

  2. AP_Sports - 111,336 potential exposures/retweet

  3. USATODAYsports - 79,936 potential exposures/retweet

Top Locations

This analysis determined where in the world most tweets about the Sochi Winter Olympics were coming from.

The winners:

  1. United States - 31,896 tweets

  2. Canada - 13,804 tweets

  3. Россия (Russia) - 13,543

Looking at the results, the top most line was actually blank because most of the tweets didn’t contain any geolocation data. So, we decided to include user location data in our analysis, an open textbox that users could fill with whatever they wanted to.

The winners:

  1. Canada - 35,234 tweets

  2. United States - 35,220 tweets

  3. Toronto - 21,231 tweets

These results were pretty messy since users wrote their location differently (e.g. UK or United Kingdom, or London). It gives a slightly better idea of what went on, though 1,847 tweets were by users who listed their location as Earth, 795 tweets were by users from Anywhere the News Happens, and 16 @ the bar toastin to good life.

Integrate Your Data Today!

Get a 7-day free trial. No credit card necessary.