logo
Community

Research Programs

BlogForum
Back to blog
How we made a cool data visualization video using just BigQuery and Flourish

July 01, 2020

Top Companies Contributing to Open Source
byAdrienn TordaiinTips

Who are the top companies contributing to open source? This blog post looks at how CodersRank used publicly shared data to answer this question, and how they created a series of data visualization videos.

The boom of open source software brought a change in technology that shaped the world as we know it today.

Open source exists thanks to the hard work of dedicated programmers and developers. It has become the foundation of cloud computing, software-as-service, next generation databases, mobile devices, the consumer internet, and more. 

undraw open source
Source: undraw.co

We, at CodersRank, are great admirers of open source. Almost every one of us contributes to open source regularly, and we sometimes work on a project or two together. 

In this blog post, we’ll introduce you to a video that we created. This video gives you a visual representation of companies that contributed the most to open source since 2012. If you find this data interesting, then you’d probably love to know the methodology behind it. We’ll show you exactly how we gathered the data and then how we gave it a visual spin. 

Video: Top Companies Contributing to Open Source

The video, “Top Companies Contributing to Open Source | 2012-2019”, is part of a series of data visualization videos that we came up with at the end of 2019. We made these out of curiosity, after realizing that you could see certain trends forming if you put together some of the publicly shared data. 

Haven’t seen the video? Here it is:

Behind-the-scenes of our method

Companies contributing to open source - Two black men using laptop
Source: blackillustrations.com

This will be a quick overview of our method – please see the actual code used further down this page.

Measuring contributions

In measuring the contributions we only considered the commits. We know that there are many other ways to contribute to a project not just commits but in this particular case we wanted to focus on the commits.

Defining contributing authors

We relied on the email addresses of the authors. The second part of the email is usually the company’s domain.

Assigning commits to a company

There are around 2.4B public commits in GitHub (since 2011) and we had to analyze each and every one of them to answer this question. Thankfully, not manually!

Using the GitHub API to extract that amount of data would have been impossible. Thanks to the GitHub Archive Project, all the public GitHub events are stored in a publicly available BigQuery database. Using SQL to extract data made the process easy and painless.

Cleaning up the data

After we counted the commits for each company, the data needed to be cleaned. First, we excluded email providers like gmail, hotmail, yandex etc. Then, we excluded a few more, as there were some cases where the commits were made by bots.

Converting the results to the expected format

We used Flourish to create the videos. The data had to be converted into a format that is acceptable for Flourish (i.e.: handling months with no data from a given company).

Implementation

Step 1: get the commits/domains

Top companies contributing to open source - PushEvent in BigQuery-  get the commits/domains

The payload column is what we needed here, since it contained the email address. In our example it is a454492e42fd9810e577ebee548c7e59bd883bca@live.com.au. GitHub hashed the first part of the email, but we didn’t need that anyway, because we were only curious about the domain-level information. 

The query to count the commits/domain name looked like this:

## pre-2015 API
CREATE TEMP FUNCTION
 json2array(json STRING)
 RETURNS ARRAY<STRING>
 LANGUAGE js AS """
         return JSON.parse(json).map(x=>JSON.stringify(x));
       """;
WITH
 export_domains AS(
 SELECT
   DATE_TRUNC(DATE(created_at), month) AS month,
   emails,
   ARRAY(
   SELECT
     REGEXP_EXTRACT(x, "@(.*)")
   FROM
     UNNEST(emails) x
   WHERE
     REGEXP_EXTRACT(x, "@(.*)") IS NOT NULL) AS domains
 FROM (
   SELECT
     * EXCEPT(array_commits),
     ARRAY(
     SELECT
       JSON_EXTRACT_SCALAR(x,
         '$[1]')
     FROM
       UNNEST(array_commits) x) emails
   FROM (
     SELECT
       created_at,
       json2array(JSON_EXTRACT(payload,
           '$.shas')) array_commits
     FROM
       `githubarchive.day.20130101`
     WHERE
       type='PushEvent' )))
SELECT
 month,
 flattened_domains AS email_domain,
 COUNT(flattened_domains) AS domain_count
FROM (
 SELECT
   month,
   flattened_domains
 FROM
   export_domains
 CROSS JOIN
   UNNEST(export_domains.domains) AS flattened_domains )
GROUP BY
 month,
 email_domain
ORDER BY
 month,
 domain_count DESC

After 2015, the format of the payload changed a bit and required a slightly different query:

## post-2015 API
CREATE TEMP FUNCTION
 json2array(json STRING)
 RETURNS ARRAY<STRING>
 LANGUAGE js AS """
         return JSON.parse(json).map(x=>JSON.stringify(x));
       """;
WITH
 export_domains AS(
 SELECT
   DATE_TRUNC(DATE(created_at), month) AS month,
   emails,
   ARRAY(
   SELECT
     REGEXP_EXTRACT(x, "@(.*)")
   FROM
     UNNEST(emails) x
   WHERE
     REGEXP_EXTRACT(x, "@(.*)") IS NOT NULL) AS domains
 FROM (
   SELECT
     * EXCEPT(array_commits),
     ARRAY(
     SELECT
       JSON_EXTRACT_SCALAR(x,
         '$.author.email')
     FROM
       UNNEST(array_commits) x) emails
   FROM (
     SELECT
       created_at,
       json2array(JSON_EXTRACT(payload,
           '$.commits')) array_commits
     FROM
       `githubarchive.day.20150102`
     WHERE
       type='PushEvent' )))
SELECT
 month,
 flattened_domains AS email_domain,
 COUNT(flattened_domains) AS domain_count
FROM (
 SELECT
   month,
   flattened_domains
 FROM
   export_domains
 CROSS JOIN
   UNNEST(export_domains.domains) AS flattened_domains )
GROUP BY
 month,
 email_domain
ORDER BY
 month,
 domain_count DESC

The result looked like this:


Row
month email_domain domain_count
1 2015-01-01 gmail.com 131357
2 2015-01-01 users.noreply.github.com 8802
3 2015-01-01 python.org5786
4 2015-01-01 hotmail.com4942
5 2015-01-01 fhda.edu 3888
6 2015-01-01 yahoo.com3216
7 2015-01-01 etudes.org 2736
8 2015-01-01 qq.com 1955
9 2015-01-01 sly.mn 1908
10 2015-01-01 foothill.edu 1848

Step 2: exclude email providers

The heavy lifting was done by BigQuery. We exported the results into a .csv file and used the good old Jupyter Notebooks to clean up the data.

As you can see in the example result, not surprisingly, the first one was gmail.com. Our next task was to remove the email providers from the list. 

We used a GitHub contribution of the most popular email domains for the cleanup: https://gist.github.com/tbrianjones/5992856/

And we also added some other blacklisted domains (excluded_domains.txt):

.(none)

91177308-0d34-0410-b5e6-96231b3b80d8

samo-laptop.(none)

dd0e9695-b195-4be7-bd10-2dea1a65a6b6

ubuntu.(none)

b8fc166d-592f-0410-95f2-cb63ce0dd405

b9a71923-0436-4b27-9f14-aed3839534dd

b2dd03c8-39d4-4d8f-98ff-823fe69b080e

0b4bb1d4-4e5a-0410-9cc4-b2b747904278

709f56b5-9817-0410-a4d7-c38de5d9e867

iki.fi

Gmail.com

none

example.com

1a063a9b-81f0-0310-95a4-ce76da25c4cd

localhost.localdomain

localhost

localhost.(none)

home

b8457f37-d9ea-0310-8a92-e5e31aec5664

li7-202.members.linode.com

g

users.noreply.github.com

us.door43.org

mailinator.com

smullindesign.com

review.openstack.org

nyarlabo.com

boston.com

li.gugod.org

niob.xnis.de

sly.mn

kazer.org

recoil.org

tsaousis.gr

rituwall.com

cbrese.com

renovateapp.com

scrapers.everypolitician.org

Step-by-step code walkthrough

Load BigQuery results

from tqdm.notebook import tqdm
 
import pandas as pd
import numpy as np
 
tqdm.pandas()
 
df = pd.read_csv("./email_domains_large.csv")

Merge the list of domains we want to exclude:

free_providers = list()
with open("./free_email_provider_domains.txt", "r") as f:
    for line in f.readlines():
        free_providers.append(line.strip())
excluded_emails = list()
with open("./excluded_domains.txt", "r") as fe:
    for line in fe.readlines():
        excluded_emails.append(line.strip())
free_providers = free_providers + excluded_emails

Add a new column to the dataset, whether the domain is a free email providers’ domain

df["free"] = df["email_domain"].progress_apply(lambda x: x in free_providers)

Create a list without the email providers

df_filtered = df[(~df["free"])].copy()

Add a row counter and limit the data to those domains that appear at least once among the top 30. This will make the final dataframe smaller and easier to handle.

rn = list()
for _, df_tmp in df_filtered.groupby("month"):
    t = list(range(df_tmp.shape[0]))
    rn += t
df_filtered["rn"] = rn
domains = np.unique(df_filtered[df_filtered["rn"] <= 30]["email_domain"])

Step #3: format data

As it was mentioned before, we used Flourish to create the video. In some cases there were empty months (the company didn’t have any commits) and Flourish expected the columns to be months not companies. So we had to make this transformation too.

Step-by-step code walkthrough

df_final = df_filtered[df_filtered["email_domain"].apply(lambda x: x in domains)].copy()
df_final["month"] = pd.to_datetime(df_final["month"])
date_range = pd.date_range(np.min(df_final["month"]), np.max(df_final["month"]), freq="MS")
 
temp_df_list = list()
for _, repo_data in df_final.groupby("email_domain"):
 
    df_temp = pd.DataFrame()
    df_temp["month"] = date_range
    df_temp = df_temp.merge(repo_data, on="month", how="left")
    df_temp.fillna(method="ffill", inplace=True)
    df_temp["email_domain"].fillna(method="bfill", inplace=True)
    df_temp.fillna(0, inplace=True)
    temp_df_list.append(df_temp)
 
df_full_data = pd.concat(temp_df_list, ignore_index=True).sort_values(["month", "domain_count_rolling"], ascending=[True, False])

df_chart_race_final = pd.DataFrame()
df_chart_race_final["email_domain"] = list(df_full_data["email_domain"].unique())
 
for current_month, monthly_data in df_full_data.groupby("month"):
    month_name = current_month.strftime("%Y-%m") 
    
    df_temp = df_full_data[["month", "email_domain", "domain_count_rolling"]].query("month == @current_month")
    df_chart_race_final = df_chart_race_final.merge(
                df_temp.drop("month", axis=1).rename(index=int, 
                                             columns={"domain_count_rolling": month_name}), 
                on="email_domain", how="left")

The full notebook can be found here.

Final word

Black Man _ Woman Using Mobile Phone A
Source: blackillustrations.com

More and more companies are recognizing the importance of open source software development and are committed to support it. 

We hope that visualizing just a slice of the data that these amazing men and women generate round the clock is a way to acknowledge their hard work. Thanks to them and the millions of hours they invested to build open source products, we get to use our everyday apps and software seamlessly. 

Thank a developer today!

About

Adrienn Tordai
Growth Marketer / Pizza Enthusiast @CodersRank. I love the Blue Jays, books, and The Office. Tell Elon I said hi. Always waiting for a Steam sale. | CodersRank: Our goal is supporting DEVELOPERS’ growth by their always up to date, professional CodersRank profile

open source

Recent Posts

November 22, 2024

Top Image-to-Text Conversion Tools Every Developer Should Know

See post

November 22, 2024

Beyond Coding: Developing Soft Skills for a Well-Rounded Developer Career

See post

ocr

October 29, 2024

How OCR Helps in Text Extraction From Multiple Images at Once?

See post

Contact us

Swan Buildings (1st floor)20 Swan StreetManchester, M4 5JW+441612400603community@developernation.net
HomeCommunityDN Research ProgramPanel ProgramBlog

Resources

Knowledge HubPulse ReportReportsForumEventsPodcast
Code of Conduct
SlashData © Copyright 2024 |All rights reserved
Cookie Policy |Privacy Policy