GitHub has recently announced its third annual Data Challenge—a competition involving visuals and prose celebrating GitHub’s data. Contestants have produced a variety of cool projects in the last two years, and we’re really keen to see what they will come up with next. Heck, we’re also entering the competition ourselves!

What if you have a killer idea for GitHub’s Data Challenge but no money, servers, or programmers at your disposal? We have the solution—you can sign up to Xplenty for free, process the data via our visual editor, and run it on a cluster without any installations or code. Let’s look at a sample project to see how it’s done.

Processing GitHub Data

The Data

GitHub provides several sources of data: API, the GitHub archive, Google BigQuery, and GHTorrent. We accessed the GitHub archive and loaded all data obtained since January 2012 to our public Amazon S3 directory at s3://

The data arrives as JSON objects with 18 possible event types—from new commits to adding project members. Here’s a sample object:

    "type": "PushEvent",
    "repo": {
      "id": 3055800,
      "url": "",
      "name": "knowledge-point/tinypm-backup"
    "created_at": "2012-01-01T00:00:09Z",
    "payload": {
      "ref": "refs/heads/master",
      "push_id": 55756268,
      "commits": [{
        "sha": "ad9010cbf0ecfd252c873ea7530342291f3e574b",
        "author": {
          "email": "",
          "name": "Knowledge Point"
        "url": "",
        "message": "backup at Sun Jan  1 00:00:02 UTC 2012"
      "head": "ad9010cbf0ecfd252c873ea7530342291f3e574b",
      "size": 1
    "actor": {
      "login": "kp-backup",
      "id": 1287779,
      "url": "",
      "avatar_url": "",
    "gravatar_id": "b55e2cae26595d21039ad1bc05db5950"
    "public": true,
    "id": "1508512236"


The following dataflow processes GitHub archive data from July and determines how many commits were made per language per day:

GitHub dataflow

  1. Github_archive — loads GitHub data from Amazon S3. Note that the path wildcard loads all files from July 2014 and that the JSON source type is selected. All the fields are filled in automatically by clicking the green button at the top right.

  2. extract_data — selects fields and gets rid of irrelevant data. Hashtags are used to select JSON properties, e.g. repo#'language'. The exact datetime is converted to a date-only format via the ToString(created_at, 'yyyy-MM-dd')function.


  3. filterpushevents — as its name suggests, filters only push events (commits) via the event_type alias that is set in the previous component.


  4. aggbylang_day — counts commits by language per day.


  5. sortbydaybylanguage — sorts the output by date and language.


  6. target_analysis — stores results back to S3.



We opened the results in Excel and inserted a pivot table to get the top 10 languages by commits per day (there are dozens of languages on GitHub). Finally, we created a stacked line graph that shows the results:

GitHub July 2014 Top 10 Languages by Commits per Day

JavaScript, Java, and CSS were the most frequently committed languages in July 2014. It’s plain to see that the number of commits changed throughout the week, so we created another pivot table and a bar chart that details July’s commits by day of the week. Programmers didn’t commit much code on Saturdays but were really productive on Tuesdays and Wednesdays:

GitHub July 2014 Commits by Language per Dat of the Week

Did you get inspired to create your own GitHub data project? Cool! Sign up with Xplenty for free and start processing GitHub data right away. We’re here if you need any help.