July 01, 2020
Who are the top companies contributing to open source? This blog post looks at how CodersRank used publicly shared data to answer this question, and how they created a series of data visualization videos.
The boom of open source software brought a change in technology that shaped the world as we know it today.
Open source exists thanks to the hard work of dedicated programmers and developers. It has become the foundation of cloud computing, software-as-service, next generation databases, mobile devices, the consumer internet, and more.
We, at CodersRank, are great admirers of open source. Almost every one of us contributes to open source regularly, and we sometimes work on a project or two together.
In this blog post, we’ll introduce you to a video that we created. This video gives you a visual representation of companies that contributed the most to open source since 2012. If you find this data interesting, then you’d probably love to know the methodology behind it. We’ll show you exactly how we gathered the data and then how we gave it a visual spin.
The video, “Top Companies Contributing to Open Source | 2012-2019”, is part of a series of data visualization videos that we came up with at the end of 2019. We made these out of curiosity, after realizing that you could see certain trends forming if you put together some of the publicly shared data.
Haven’t seen the video? Here it is:
This will be a quick overview of our method – please see the actual code used further down this page.
In measuring the contributions we only considered the commits. We know that there are many other ways to contribute to a project not just commits but in this particular case we wanted to focus on the commits.
We relied on the email addresses of the authors. The second part of the email is usually the company’s domain.
There are around 2.4B public commits in GitHub (since 2011) and we had to analyze each and every one of them to answer this question. Thankfully, not manually!
Using the GitHub API to extract that amount of data would have been impossible. Thanks to the GitHub Archive Project, all the public GitHub events are stored in a publicly available BigQuery database. Using SQL to extract data made the process easy and painless.
After we counted the commits for each company, the data needed to be cleaned. First, we excluded email providers like gmail, hotmail, yandex etc. Then, we excluded a few more, as there were some cases where the commits were made by bots.
We used Flourish to create the videos. The data had to be converted into a format that is acceptable for Flourish (i.e.: handling months with no data from a given company).
The payload column is what we needed here, since it contained the email address. In our example it is a454492e42fd9810e577ebee548c7e59bd883bca@live.com.au. GitHub hashed the first part of the email, but we didn’t need that anyway, because we were only curious about the domain-level information.
The query to count the commits/domain name looked like this:
## pre-2015 API
CREATE TEMP FUNCTION
json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
return JSON.parse(json).map(x=>JSON.stringify(x));
""";
WITH
export_domains AS(
SELECT
DATE_TRUNC(DATE(created_at), month) AS month,
emails,
ARRAY(
SELECT
REGEXP_EXTRACT(x, "@(.*)")
FROM
UNNEST(emails) x
WHERE
REGEXP_EXTRACT(x, "@(.*)") IS NOT NULL) AS domains
FROM (
SELECT
* EXCEPT(array_commits),
ARRAY(
SELECT
JSON_EXTRACT_SCALAR(x,
'$[1]')
FROM
UNNEST(array_commits) x) emails
FROM (
SELECT
created_at,
json2array(JSON_EXTRACT(payload,
'$.shas')) array_commits
FROM
`githubarchive.day.20130101`
WHERE
type='PushEvent' )))
SELECT
month,
flattened_domains AS email_domain,
COUNT(flattened_domains) AS domain_count
FROM (
SELECT
month,
flattened_domains
FROM
export_domains
CROSS JOIN
UNNEST(export_domains.domains) AS flattened_domains )
GROUP BY
month,
email_domain
ORDER BY
month,
domain_count DESC
After 2015, the format of the payload changed a bit and required a slightly different query:
## post-2015 API
CREATE TEMP FUNCTION
json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
return JSON.parse(json).map(x=>JSON.stringify(x));
""";
WITH
export_domains AS(
SELECT
DATE_TRUNC(DATE(created_at), month) AS month,
emails,
ARRAY(
SELECT
REGEXP_EXTRACT(x, "@(.*)")
FROM
UNNEST(emails) x
WHERE
REGEXP_EXTRACT(x, "@(.*)") IS NOT NULL) AS domains
FROM (
SELECT
* EXCEPT(array_commits),
ARRAY(
SELECT
JSON_EXTRACT_SCALAR(x,
'$.author.email')
FROM
UNNEST(array_commits) x) emails
FROM (
SELECT
created_at,
json2array(JSON_EXTRACT(payload,
'$.commits')) array_commits
FROM
`githubarchive.day.20150102`
WHERE
type='PushEvent' )))
SELECT
month,
flattened_domains AS email_domain,
COUNT(flattened_domains) AS domain_count
FROM (
SELECT
month,
flattened_domains
FROM
export_domains
CROSS JOIN
UNNEST(export_domains.domains) AS flattened_domains )
GROUP BY
month,
email_domain
ORDER BY
month,
domain_count DESC
The result looked like this:
Row | month | email_domain | domain_count |
1 | 2015-01-01 | gmail.com | 131357 |
2 | 2015-01-01 | users.noreply.github.com | 8802 |
3 | 2015-01-01 | python.org | 5786 |
4 | 2015-01-01 | hotmail.com | 4942 |
5 | 2015-01-01 | fhda.edu | 3888 |
6 | 2015-01-01 | yahoo.com | 3216 |
7 | 2015-01-01 | etudes.org | 2736 |
8 | 2015-01-01 | qq.com | 1955 |
9 | 2015-01-01 | sly.mn | 1908 |
10 | 2015-01-01 | foothill.edu | 1848 |
The heavy lifting was done by BigQuery. We exported the results into a .csv file and used the good old Jupyter Notebooks to clean up the data.
As you can see in the example result, not surprisingly, the first one was gmail.com. Our next task was to remove the email providers from the list.
We used a GitHub contribution of the most popular email domains for the cleanup: https://gist.github.com/tbrianjones/5992856/.
And we also added some other blacklisted domains (excluded_domains.txt):
.(none)
91177308-0d34-0410-b5e6-96231b3b80d8
samo-laptop.(none)
dd0e9695-b195-4be7-bd10-2dea1a65a6b6
ubuntu.(none)
b8fc166d-592f-0410-95f2-cb63ce0dd405
b9a71923-0436-4b27-9f14-aed3839534dd
b2dd03c8-39d4-4d8f-98ff-823fe69b080e
0b4bb1d4-4e5a-0410-9cc4-b2b747904278
709f56b5-9817-0410-a4d7-c38de5d9e867
iki.fi
Gmail.com
none
example.com
1a063a9b-81f0-0310-95a4-ce76da25c4cd
localhost.localdomain
localhost
localhost.(none)
home
b8457f37-d9ea-0310-8a92-e5e31aec5664
li7-202.members.linode.com
g
users.noreply.github.com
us.door43.org
mailinator.com
smullindesign.com
review.openstack.org
nyarlabo.com
boston.com
li.gugod.org
niob.xnis.de
sly.mn
kazer.org
recoil.org
tsaousis.gr
rituwall.com
cbrese.com
renovateapp.com
scrapers.everypolitician.org
Load BigQuery results
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np
tqdm.pandas()
df = pd.read_csv("./email_domains_large.csv")
Merge the list of domains we want to exclude:
free_providers = list()
with open("./free_email_provider_domains.txt", "r") as f:
for line in f.readlines():
free_providers.append(line.strip())
excluded_emails = list()
with open("./excluded_domains.txt", "r") as fe:
for line in fe.readlines():
excluded_emails.append(line.strip())
free_providers = free_providers + excluded_emails
Add a new column to the dataset, whether the domain is a free email providers’ domain
df["free"] = df["email_domain"].progress_apply(lambda x: x in free_providers)
Create a list without the email providers
df_filtered = df[(~df["free"])].copy()
Add a row counter and limit the data to those domains that appear at least once among the top 30. This will make the final dataframe smaller and easier to handle.
rn = list()
for _, df_tmp in df_filtered.groupby("month"):
t = list(range(df_tmp.shape[0]))
rn += t
df_filtered["rn"] = rn
domains = np.unique(df_filtered[df_filtered["rn"] <= 30]["email_domain"])
As it was mentioned before, we used Flourish to create the video. In some cases there were empty months (the company didn’t have any commits) and Flourish expected the columns to be months not companies. So we had to make this transformation too.
df_final = df_filtered[df_filtered["email_domain"].apply(lambda x: x in domains)].copy()
df_final["month"] = pd.to_datetime(df_final["month"])
date_range = pd.date_range(np.min(df_final["month"]), np.max(df_final["month"]), freq="MS")
temp_df_list = list()
for _, repo_data in df_final.groupby("email_domain"):
df_temp = pd.DataFrame()
df_temp["month"] = date_range
df_temp = df_temp.merge(repo_data, on="month", how="left")
df_temp.fillna(method="ffill", inplace=True)
df_temp["email_domain"].fillna(method="bfill", inplace=True)
df_temp.fillna(0, inplace=True)
temp_df_list.append(df_temp)
df_full_data = pd.concat(temp_df_list, ignore_index=True).sort_values(["month", "domain_count_rolling"], ascending=[True, False])
df_chart_race_final = pd.DataFrame()
df_chart_race_final["email_domain"] = list(df_full_data["email_domain"].unique())
for current_month, monthly_data in df_full_data.groupby("month"):
month_name = current_month.strftime("%Y-%m")
df_temp = df_full_data[["month", "email_domain", "domain_count_rolling"]].query("month == @current_month")
df_chart_race_final = df_chart_race_final.merge(
df_temp.drop("month", axis=1).rename(index=int,
columns={"domain_count_rolling": month_name}),
on="email_domain", how="left")
The full notebook can be found here.
More and more companies are recognizing the importance of open source software development and are committed to support it.
We hope that visualizing just a slice of the data that these amazing men and women generate round the clock is a way to acknowledge their hard work. Thanks to them and the millions of hours they invested to build open source products, we get to use our everyday apps and software seamlessly.
Thank a developer today!
Adrienn Tordai
Growth Marketer / Pizza Enthusiast @CodersRank. I love the Blue Jays, books, and The Office. Tell Elon I said hi. Always waiting for a Steam sale. | CodersRank: Our goal is supporting DEVELOPERS’ growth by their always up to date, professional CodersRank profile