Everybody has issues, and so do users and repositories on GitHub. That’s why we decided to answer this year’s GitHub Data Challenge by heading where developers fear to tread and analyze GitHub issues in 2013.
We asked the following questions:
- How did issues distribute over time?
- Which users had the most issues?
- Which repositories had the most issues?
- Which repositories were the fastest and slowest in closing issues?
We performed our analysis by loading data from the GitHub Archive, processing it with Xplenty’s data integration on the cloud and then analysing the results in Excel. For further details, please see the last section in this post. Our processed data and scripts are all available in a GitHub repository.
Note – the data for this analysis only included issue events which happened on GitHub in 2013. Issues which were opened/closed/reopened before or after 2013 and didn’t have any relevant activity in 2013 were not part of this analysis. Issue comments weren’t included either. Also, certain data, such as issue titles and labels, were not available in the GitHub Archive.
Without further ado, here are the results of our analysis.
Issues over time
In total, there were 4,626,942 issue events on GitHub in 2013: 2,776,006 (60%) opened events, 1,778,477 (38%) closed events and 72,459 (2%) reopened events.
The above chart, which shows how many issue events happened on each day in 2013, shows the following:
- More issues were opened rather than closed on most days.
- Issue activity changed over the days of the week.
- Activity peaked on 10 February when 27,521 issues were opened and on 9 September when 17,832 issues were closed.
Let’s start with the last point. What happened on 10 February and 9 September last year that sent people into a frenzy? Did any specific users or repositories have a lot of issues? We analysed the data and found the answers.
On 10 February 2013 a user called rsdnru single-handedly opened 21,097 issues. This user was suspiciously related to repositories where most of the issues were opened that day – rsdn/nemerle with 10,164 issues, rsdn/RsdnFormatter with 5,472 issues, and rsdn/avalon with 5,461 issues. On 9 September there was intense activity over at the Khan/khan-exercises repository. In fact, the KhanBugz user alone was responsible for most of the action that day by closing 11,267 issues.
Who are these mysterious users? Are they man or machine? rsdnru doesn’t have a GitHub profile anymore, but the rsdn repository is still up and running. As it turns out, it’s the Russian Software Developer Network. Deleting a user causes all its issues to be deleted too, so whatever happened that day will continue to remain an enigma. As for KhanBugz and khan-exercises, they belong to the famous Khan Academy which provides free online courses. KhanBugz’s issues look well formatted, so maybe they have been submitted via a form by the Khan Academy students.
Taking a look at the big picture, an average of 7,605 issues were opened each day, 4,873 were closed, and 199 were reopened. Looking at the average number of issue events per weekday clearly shows that users have the least issues on Saturdays and the most issues on Tuesdays:
Reviewing the number of issues per month, November was the most active month in 2013 while January was the least active. Maybe folks were getting their issues out before the start of the holiday season?
Users got issues
399,421 users had some kind of issue event on GitHub in 2013. Most of them had very little though: about 41% of the above users only opened/closed/reopened one issue in 2013, over 84% no more than 10 and over 98% no more than 100.
But some users had tons of issues. Remember KhanBugz? This user opened and closed more issues than anybody else in 2013 thus making it the GitHub issues heavyweight champion of the year. The reopened issues chart is ruled by sbezborotest, although they look a bit automated considering their titles (“testing 123”). This means prock-fife was the number one reopener of 2013.
Opened Issues Top 5 Users 2013
Closed Issues Top 5 Users 2013
Reopened Issues Top 5 Users in 2013
Repositories of issues
268,980 repositories had issue activity in 2013. Just like users, most repositories didn’t have many issues: over 27% had just a single issue event, over 78% had no more than 10 issue events and about 97% had no more than 100.
The Wrath of Khan is not over yet – the Khan/khan-exercises repository topped the opened and closed issues charts. sbezborotest/test was back at it again in the reopened issues chart, so fifengine/fifengine was the real “winner” here with the most reopened issues in 2013.
Open Issues Top 5 Repositories 2013
Closed Issues Top 5 Repositories 2013
Reopened Issues Top 5 Repositories 2013
Taking Care of Issues
How long did it take to take care of issues on GitHub? We found out that a lot of issues were closed immediately after they were opened. Therefore, we decided to filter issues that took at least 30 seconds to close – reasonable time for a human to take a look at them and then take action.
It wasn’t enough. A lot of repositories opened and closed no more than ten issues throughout the year. Even repositories that handled more than 100 issues included a lot of test repositories that no longer exist. We wanted the numbers to tell us something about big on-going projects, so finally we filtered repositories that opened and closed more than 1,000 issues in 2013.
And the winner is … turesheim/eclipse-utilities which took care of 5,189 issues in an average of 1.88 days per issue! But wait a minute, something’s not right – a lot of the issues there seemed automated and “Inadvertently created by Zapier sync” as their title proclaimed. Hmm. The repository that took second place, MrNukealizer/SCII-External-Maphack, also had a lot of auto-reported issues, crash issues in this case, although they were quickly closed by, probably, a human. So, maybe woothemes/woocommerce should be the real winner since they worked hard to take care of real issues. Congrats!
Fastest Issue Closing Repositories 2013
|Repository||Closed Issues||Average Closing Time (days)|
Slowest Issue Closing Repositories 2013 (by days)
|Repository||Closed Issues||Average Closing Time (days)|
Collecting, processing and analyzing the data
We got the data by downloading the GitHub archive with a script and then uploading the files to our public Amazon S3 directory at s3://xplenty.public.s3.amazonaws.com/github_archive/ Considering the large quantities of data, we used Xplenty to process it and then analysed the results in Excel.
In most cases, the data was processed as follows:
- The data was loaded from Amazon S3. Since all the filenames were in date format, a wildcard in the file path was used to load data only for 2013.
- IssuesEvent were filtered.
- The relevant JSON fields were selected. To get the number of issues opened/closed/reopened as columns, we used the select component with
caseconditions that set 1 or 0 accordingly for each action. The aggregate component then summed them up by the relevant group (day, issue id, user id, or repository id).
- The data was aggregated.
- The results were stored back to Amazon S3.
Below are specific dataflows that we used to process the data.
Issues per day
Issues per user
The extra selectfinalfields component generated user URLs on GitHub.
Issues per repository
This dataflow was slightly more complex:
- The clone component split the dataflow in two where one side matched ‘open’ actions and the other ‘closed’.
- Both sides were joined by issue number, thus providing the relevant opening and closing datetime in the same row.
- The time difference was calculated using a select component and the
Exploring GitHub issue events in 2013 revealed several insights: More issues were opened than closed; there weren’t many issues on Saturdays; November was full of issues; bots have issues too; Khan Academy has lots of issues; and some repositories took a good part of the year to deal with their issues. We were happy to eat our own dog food and process the data with Xplenty to help shed new light on GitHub.