Skip to main content
An image depicting the final network mapping. 14 concentric circles with lines running between them.

The project has been split into different sections for discussion of the data collection, approach, analysis, interesting results, and more - see the contents on the left to jump to a section.

In general, communities form around common interests and bring people together. Online communities can take many shapes and forms in today's interconnected world. More specifically, communities may allow people to interact in social groups, and contribute to shared experiences.

Developer communities are no different; groups of likeminded individuals come together to learn, ask, debug, and build programs and code for a variety of purposes, using a variety of languages. These communities exist on sites such as Reddit, StackOverflow, and Google Groups (to name a few). Developer and programming communities can have a more macro approach (focusing on what good programming looks like, accessibility, and security across programming concepts in general), or a micro approach (usually focused on a specific language). In these micro communities are interesting nuances:

For this project, we explored the interactions between different programming communities on one particular platform: Reddit. We wanted to learn more about the micro communities, and what they can tell us about developer and programming online communities. We also wanted to determine influencers and topics of interests in selected subreddits and find similarities between selected subreddits. To that end, our research questions were:

  1. Which subreddits have common users, who do they interact with, and who are important users in the network?
  2. What kind of posts (questions, news, etc.) do users post on each subreddit?

Scope, Data, and Preprocessing

We collected data from 14 subreddits below using the Push Shift API. We collected 312,851 records during the date range 1/1/2021–6/30/2021.

C_Programming COBOL Fortran Haskell HTML HTML5 JavaScript LaTeX learnjava learnpython LISP MATLAB perl Rlanguage

Next, we preprocessed the data by removing any records that were missing or incomplete, such as removing “deleted” authors, and splitting the records into datasheets with edges and nodes to be able to do network analyses.

We focused on aggregated analysis vs. temporal analysis for the project. The initial batch of languages we selected was based on languages we found in syllabi at George Mason University, but in general these languages were picked randomly for interest. We also made sure to include languages on the spectrum of newness and maturity. We selected the most active subreddit for each language, and excluded “dead” languages (i.e., Algol, APL).


Analysis and Findings

Commonalities among Subreddit Communities

Below is the final network graph that was created for all the networks.

An image depicting the final network mapping. 14 concentric circles with lines running between them.

We initially wanted to see what else we could learn about our users mainly through centrality measures and PageRank. We spent some time identifying each community within our map and labeled them as such. Python and JavaScript subreddits account for a large number of interactions between users.

If you notice within each community, there are concentric circles moving outwards from the center. Users in the center interact with that subreddit often, and those on the outside do so less. Those right on the edge of the circle often only have 1 post/comment (extremely low engagement).

Programming Subreddit Descriptive Data

C_Programming

  • Created3/27/2008
  • Subscribers (Jul 2021)112,000
  • Daily Comments78
  • Total Posts4,278
  • Users Cross-posting727

COBOL

  • Created6/5/2009
  • Subscribers (Jul 2021)2,294
  • Daily Comments3
  • Total Posts133
  • Users Cross-posting36

Fortran

  • Created9/29/2009
  • Subscribers (Jul 2021)5,989
  • Daily Comments11
  • Total Posts264
  • Users Cross-posting99

Haskell

  • Created1/25/2008
  • Subscribers (Jul 2021)67,417
  • Daily Comments65
  • Total Posts2,068
  • Users Cross-posting320

HTML

  • Created9/5/2009
  • Subscribers (Jul 2021)34,545
  • Daily Comments4
  • Total Posts1,284
  • Users Cross-posting271

HTML5

  • Created9/22/2009
  • Subscribers (Jul 2021)39,348
  • Daily Comments1
  • Total Posts524
  • Users Cross-posting203

JavaScript

  • Created1/25/2008
  • Subscribers (Jul 2021)1,720,113
  • Daily Comments57
  • Total Posts7,008
  • Users Cross-posting618

LaTeX

  • Created3/4/2008
  • Subscribers (Jul 2021)35,613
  • Daily Comments5
  • Total Posts1,706
  • Users Cross-posting388

learnjava

  • Created1/27/2011
  • Subscribers (Jul 2021)116,403
  • Daily Comments53
  • Total Posts3,341
  • Users Cross-posting528

learnpython

  • Created10/2/2009
  • Subscribers (Jul 2021)571,031
  • Daily Comments525
  • Total Posts24,213
  • Users Cross-posting1,360

LISP

  • Created1/25/2008
  • Subscribers (Jul 2021)33,291
  • Daily Comments43
  • Total Posts824
  • Users Cross-posting182

MATLAB

  • Created8/15/2009
  • Subscribers (Jul 2021)43,046
  • Daily Comments18
  • Total Posts2,095
  • Users Cross-posting301

perl

  • Created1/25/2008
  • Subscribers (Jul 2021)15,149
  • Daily Comments7
  • Total Posts501
  • Users Cross-posting76

Rlanguage

  • Created2/11/2011
  • Subscribers (Jul 2021)26,494
  • Daily Comments8
  • Total Posts1,389
  • Users Cross-posting251

We realized the importance that the size of each subreddit played in the conversation. The number of posts and subscribers differ between the subreddits. The cards above show the number of subscribers, comments per day, total posts within the date range of our dataset, and the number of users who cross-posted to another subreddit. JavaScript was by far the largest, and COBOL the smallest, in terms of subscribers. Additionally, we can see that users posted more in Python every day on average than all the other subreddits.

From the network analysis we conducted, Python also had the largest number of posts, and cross posts. After having established that there were users that were cross posting across subreddits, our next step was to figure out what the top cross-posted subreddits were for each user within a subreddit.

One of our hypotheses was about cross-posting between subreddits based on the nature of the language. For example, HTML and JavaScript are often used together for front-end web design, so it would make sense that there would be shared users here.

However, what we discovered was that Python seems to be the strongest link between all the subreddits, having a high PageRank score on our graph, and also the largest amount of users bridging junctions between their specific subreddit and the Python subreddit.

Top 3 Connected Subreddits to Each Programming Language Subreddit

We found some interesting data to back up the claim of Python's predominance. The list above shows the top 3 other subreddits that users from a specific subreddit posted to. So, for C_Programming, users cross-posted most to Python, JavaScript, and finally Java.

Users in 8 of our 14 subreddits cross-posted most to Python. In fact, Python was ranked within the top 3 of every subreddit we analyzed. Our hypothesis based on reviewing this data is that it speaks to how ubiquitous Python has become in programming today. It is taught in schools, bootcamps, and self-learning courses; has a relatively simple learning curve compared to other languages; is used widely in industry; and interfaces well with other languages.

On closer analysis, we found that much of the discussion in more mature language subreddits such as COBOL, Fortran, and LISP involved troubleshooting how to get Python functionalities to interface with their legacy code/hardware, in addition to language migration queries.

Influencer Profiles

We calculated several centrality measures for the users in the network, including betweenness, closeness, and PageRank. Of our top influencers, we found several that had interesting profiles. Overall, they interact with users often, helping with ways to get the users' code working correctly, and being able to provide additional resources that would otherwise not have been available to the users. Some of these top influencers also link to other content on other sites, so they are marginally aware of cross-posting across sites here too.

We found one user with an extremely high score for betweenness and PageRank, a high measure for closeness, and the largest degree of all our users. This user, as it turned out, was actually a bot known as AutoModerator, often used by subreddits as a management tool. The intention of spreading shared community guidelines and rules across multiple communities seems to work - if we needed to spread a message across all 14 subreddits, this user would be our pick.

What are users posting about? Our Topic Modeling Methodology

A word cloud representing the most common words across subreddits.
A word cloud representing the most common words across subreddits with further processing.

For our topic modeling attempt, we used Latent Dirichlet Allocation (LDA) to create a set amount of bins, and attempt to sort through the text and create general grouped topics. We expected ~4 topics based on trial and error with empirical observation of several subreddits: Tech Help, General Language, Jobs and Advice, Off-Topic. LDA is unsupervised; we decided that it may be better at finding hidden subjects or other patterns that we wouldn't have tagged manually given a different approach.

4 generalized word clouds for the topics generated across all the data using LDA.

Overall Findings

Overall, it is difficult to generalize subreddits, even within the same domain.

  1. High-level language communities seem to be the most popular.
  2. A few users are very active.
  3. New languages are more commonly discussed.
  4. Topics were specific to each subreddit with some overlap.
  5. Even with overlap, the proportion of posts in similar topics varied considerably.