GitHub, You Got Issues: An Analysis of Issues on GitHub in 2013

GitHub, You Got Issues: An Analysis of Issues on GitHub in 2013

Everybody has issues, and so do users and repositories on GitHub. That’s why we decided to answer this year’s GitHub Data Challenge by heading where developers fear to tread and analyze GitHub issues in 2013.

We asked the following questions:

  1. How did issues distribute over time?
  2. Which users had the most issues?
  3. Which repositories had the most issues?
  4. Which repositories were the fastest and slowest in closing issues?

We performed our analysis by loading data from the GitHub Archive, processing it with Xplenty’s data integration on the cloud and then analysing the results in Excel. For further details, please see the last section in this post. Our processed data and scripts are all available in a GitHub repository.

Note – the data for this analysis only included issue events which happened on GitHub in 2013. Issues which were opened/closed/reopened before or after 2013 and didn’t have any relevant activity in 2013 were not part of this analysis. Issue comments weren’t included either. Also, certain data, such as issue titles and labels, were not available in the GitHub Archive.

Without further ado, here are the results of our analysis.

Issues over time

GitHub Issue Events 2013

In total, there were 4,626,942 issue events on GitHub in 2013: 2,776,006 (60%) opened events, 1,778,477 (38%) closed events and 72,459 (2%) reopened events.

GitHub Issue Events per Day 2013

The above chart, which shows how many issue events happened on each day in 2013, shows the following:

  1. More issues were opened rather than closed on most days.
  2. Issue activity changed over the days of the week.
  3. Activity peaked on 10 February when 27,521 issues were opened and on 9 September when 17,832 issues were closed.

Let’s start with the last point. What happened on 10 February and 9 September last year that sent people into a frenzy? Did any specific users or repositories have a lot of issues? We analysed the data and found the answers.

On 10 February 2013 a user called rsdnru single-handedly opened 21,097 issues. This user was suspiciously related to repositories where most of the issues were opened that day – rsdn/nemerle with 10,164 issues, rsdn/RsdnFormatter with 5,472 issues, and rsdn/avalon with 5,461 issues. On 9 September there was intense activity over at the Khan/khan-exercises repository. In fact, the KhanBugz user alone was responsible for most of the action that day by closing 11,267 issues.

Who are these mysterious users? Are they man or machine? rsdnru doesn’t have a GitHub profile anymore, but the rsdn repository is still up and running. As it turns out, it’s the Russian Software Developer Network. Deleting a user causes all its issues to be deleted too, so whatever happened that day will continue to remain an enigma. As for KhanBugz and khan-exercises, they belong to the famous Khan Academy which provides free online courses. KhanBugz’s issues look well formatted, so maybe they have been submitted via a form by the Khan Academy students.

Taking a look at the big picture, an average of 7,605 issues were opened each day, 4,873 were closed, and 199 were reopened. Looking at the average number of issue events per weekday clearly shows that users have the least issues on Saturdays and the most issues on Tuesdays:

GitHub Average Issue Events per Weekday 2013

Reviewing the number of issues per month, November was the most active month in 2013 while January was the least active. Maybe folks were getting their issues out before the start of the holiday season?

GitHub Issue Events per Month 2013

Users got issues

399,421 users had some kind of issue event on GitHub in 2013. Most of them had very little though: about 41% of the above users only opened/closed/reopened one issue in 2013, over 84% no more than 10 and over 98% no more than 100.

Maximum Issue Events / Users

But some users had tons of issues. Remember KhanBugz? This user opened and closed more issues than anybody else in 2013 thus making it the GitHub issues heavyweight champion of the year. The reopened issues chart is ruled by sbezborotest, although they look a bit automated considering their titles (“testing 123”). This means  prock-fife was the number one reopener of 2013.

Opened Issues Top 5 Users 2013

github-dc-user-khanbugz.png KhanBugz 125,139
github-dc-user-antonio.png antoniovazquezblanco 55,996
github-dc-open-issues-user.png cichockimc 23,123
github-dc-open-issues-user.png rsdnru 21,098
github-dc-user-sageb0t.png sageb0t 15,084

Closed Issues Top 5 Users 2013

github-dc-user-khanbugz.png KhanBugz 68,209
github-dc-user-trel.png trel 10,041
github-dc-user-sageb0t.png sageb0t 9,227
github-dc-user-eric.png ericvaandering 8,137
github-dc-user-turesheim.png turesheim 5,212

Reopened Issues Top 5 Users in 2013

github-dc-user-sbezborotest.png sbezborotest 1,592
github-dc-user-prock-fife.png prock-fife 638
github-dc-user-ivanov007.png ivanov007 144
github-dc-user-jfelchner.png jfelchner 142
github-dc-open-issues-user.png Avangard 129

Repositories of issues

268,980 repositories had issue activity in 2013. Just like users, most repositories didn’t have many issues: over 27% had just a single issue event, over 78% had no more than 10 issue events and about 97% had no more than 100.

Maximum Issue Events / Repositories

The Wrath of Khan is not over yet – the Khan/khan-exercises repository topped the opened and closed issues charts. sbezborotest/test was back at it again in the reopened issues chart, so fifengine/fifengine was the real “winner” here with the most reopened issues in 2013.

Open Issues Top 5 Repositories 2013

github-dc-user-khanbugz.png Khan/khan-exercises 115,631
github-dc-repo-pulwifi.png pulWifi/pulWifi 55,968
github-dc-repo.png cichockimc/zapier 23,113
github-dc-user-sageb0t.png sageb0t/testsage 14,332
github-dc-repo-rsdn.png rsdn/nemerle 10,272

Closed Issues Top 5 Repositories 2013

github-dc-user-khanbugz.png Khan/khan-exercises 70,455
github-dc-user-sageb0t.png sageb0t/testsage 8,705
github-dc-repo.png turesheim/eclipse-utilities 5,205
github-dc-user-sbezborotest.png sbezborotest/test 4,898
github-dc-repo-artworx.png artworx/webhook-test 3,709

Reopened Issues Top 5 Repositories 2013

github-dc-user-sbezborotest.png sbezborotest/test 1,593
github-dc-user-prock-fife.png fifengine/fifengine 638
github-dc-repo-glaskart.png glasklart/hd 498
github-dc-repo-owncloud.png owncloud/core 271
github-dc-repo-atlantiss.png Atlantiss/AtlantissCore 184

Taking Care of Issues

How long did it take to take care of issues on GitHub? We found out that a lot of issues were closed immediately after they were opened. Therefore, we decided to filter issues that took at least 30 seconds to close – reasonable time for a human to take a look at them and then take action.

It wasn’t enough. A lot of repositories opened and closed no more than ten issues throughout the year. Even repositories that handled more than 100 issues included a lot of test repositories that no longer exist. We wanted the numbers to tell us something about big on-going projects, so finally we filtered repositories that opened and closed more than 1,000 issues in 2013.

And the winner is … turesheim/eclipse-utilities which took care of 5,189 issues in an average of 1.88 days per issue! But wait a minute, something’s not right – a lot of the issues there seemed automated and “Inadvertently created by Zapier sync” as their title proclaimed. Hmm. The repository that took second place, MrNukealizer/SCII-External-Maphack, also had a lot of auto-reported issues, crash issues in this case, although they were quickly closed by, probably, a human. So, maybe woothemes/woocommerce should be the real winner since they worked hard to take care of real issues. Congrats!

The slowest repositories to take care of issues may not have any excuse. Down at makoto-unity/unity-doc-script people had to wait an average of about 121 days for their 1,661 issues to close. Actually, a lot of these issues look auto-reported, which doesn’t really help. Interesting to see the famous angular.js coming in third by taking care of 1,354 issues in an average time of about 39 days per issue. JavaScript issues must be harder to solve.

Fastest Issue Closing Repositories 2013

Repository Closed Issues Average Closing Time (days)
github-dc-user-turesheim.png turesheim/eclipse-utilities 5,189 1.88
github-dc-repo-mrnuke.png MrNukealizer/SCII-External-Maphack 1,047 5.59
github-dc-repo-woothemes.png woothemes/woocommerce 1,665 7.13
github-dc-repo-bootstrap.png twbs/bootstrap 2,435 7.77
github-dc-repo-yiisoft.png yiisoft/yii2 1,075 8.27

Slowest Issue Closing Repositories 2013 (by days)

Repository Closed Issues Average Closing Time (days)
github-dc-repo-makoto.png makoto-unity/unity-doc-script 1,661 121.12
github-dc-repo-rust.png mozilla/rust 2,207 52.76
github-dc-repo-angularjs.png angular/angular.js 1,354 39.02
github-dc-repo-habitrpg.png lefnire/habitrpg 1,347 36.71
github-dc-repo-makoto.png unity3d-jp/unity-doc-script 1,091 35.19

Collecting, processing and analyzing the data

We got the data by downloading the GitHub archive with a script and then uploading the files to our public Amazon S3 directory at s3://xplenty.public.s3.amazonaws.com/github_archive/ Considering the large quantities of data, we used Xplenty to process it and then analysed the results in Excel.

In most cases, the data was processed as follows:

  1. The data was loaded from Amazon S3. Since all the filenames were in date format, a wildcard in the file path was used to load data only for 2013.
    cloud storage source
  2. IssuesEvent were filtered.
    filter
  3. The relevant JSON fields were selected. To get the number of issues opened/closed/reopened as columns, we used the select component with case conditions that set 1 or 0 accordingly for each action. The aggregate component then summed them up by the relevant group (day, issue id, user id, or repository id).
    select
  4. The data was aggregated.
    aggregate
  5. The results were stored back to Amazon S3.
    cloud storage destination

Below are specific dataflows that we used to process the data.

Issues per day

issues per day

Issues per user

issues per user

The extra select_final_fields component generated user URLs on GitHub.

Issues per repository

issues per repository

Closing time

cloud storage source

This dataflow was slightly more complex:

  1. The clone component split the dataflow in two where one side matched ‘open’ actions and the other ‘closed’.
  2. Both sides were joined by issue number, thus providing the relevant opening and closing datetime in the same row.
  3. The time difference was calculated using a select component and the SecondsBetween function.

Summary

Exploring GitHub issue events in 2013 revealed several insights: More issues were opened than closed; there weren’t many issues on Saturdays; November was full of issues; bots have issues too; Khan Academy has lots of issues; and some repositories took a good part of the year to deal with their issues. We were happy to eat our own dog food and process the data with Xplenty to help shed new light on GitHub.


Integrate Your Data Today!

Try Xplenty free for 7 days. No credit card required.