Introduction

Reddit and Stackoverflow are very useful datasets, the same goes for Wikipedia, and it's very interesting to have the ability to write queries that run on large datasets simply because you're able to search for things in ways you couldn't before with just a search box.

Setting up the infrastructure required to load/collect and index data, designing the database schema and importing data are steps that require quite some time and effort (this can quickly become a fulltime job, and in the case of large databases may even require a dedicated team that only takes care of this process).

If this is already taken care of and available through a service such as BigQuery (that even comes with public datasets already loaded), you can start analyzing/exploring the data right away.

Google BigQuery & Reddit data

In order to use BigQuery you need to create a project in the development console and the open one of the tables in the public dataset.

Without signing up to a free trial, or without signing up for a paid plan BigQuery will let you run a certain number of search queries per day, but even with that limitation in place, this is still quite useful.

I thought about finding subreddits where users feel more excited about the posts we're interested in than what they usually see on that subreddit. However, that caused the most relevant subs to fall beneath the average (probably because users of the most relevant sub are not so excited about posts on the most relevant topic in that subreddit).

Another way to rank them is: higher ranking for subreddits with low posts/subscribers ratio since the time a post stays on the first page is longer (because it doesn't get pushed down so fast by new posts) and it gets viewed more (since there are many subscribers).

SELECT
F.*, CONCAT("http://reddit.com/r/", F.display_name) AS sub_url
FROM (
  -- delta_over_average measures how much above the subreddit average
  SELECT 
  E.display_name      AS display_name,
  E.all_posts_cnt     AS all_posts_cnt,
  E.matched_posts_cnt AS matched_posts_cnt,
  E.matched_avg_vote  AS matched_avg_vote,
  E.all_avg_vote      AS all_avg_vote,
  E.subscribers       AS subscribers,
  (E.all_posts_cnt/E.subscribers) AS posts_subscribers_ratio,
  (((E.matched_avg_vote - E.all_avg_vote)/E.all_avg_vote)*100.0) AS delta_over_average
  FROM (
    SELECT
    subs.display_name AS display_name,
    matching.cnt AS matched_posts_cnt,
    subreddit_counts.cnt AS all_posts_cnt,
    matching.matched_avg_vote as matched_avg_vote,
    subreddit_counts.all_avg_vote as all_avg_vote,
    subs.subscribers as subscribers
    FROM (
      -- find all subreddits that have posts matching the keywords
      -- and stats for each of those subreddits
      SELECT subreddit, COUNT(id) AS cnt, AVG(ups - downs) AS matched_avg_vote
      FROM [fh-bigquery:reddit_posts.full_corpus_201509]
      WHERE REGEXP_MATCH(
              LOWER(selftext), 
              '(postgis|qgis|openstreetmap|spatial query|spatial index)'
            )
      GROUP BY subreddit
    ) matching
    JOIN ( 
      -- all subreddits and stats for each
      SELECT
      subreddit,
      COUNT(id) as cnt,
      AVG(ups - downs) AS all_avg_vote
      FROM [fh-bigquery:reddit_posts.full_corpus_201509]
      GROUP BY subreddit
    ) subreddit_counts ON matching.subreddit = subreddit_counts.subreddit
    JOIN (
         -- all subreddits with a minimum number of subscribers
         SELECT
         *
         FROM [fh-bigquery:reddit.subreddits_201509]
         WHERE subscribers > 100
    ) subs
    -- make sure the post is from that subreddit
    ON subreddit_counts.subreddit = subs.display_name
  ) E
  WHERE
  E.matched_posts_cnt > 10
  ---- uncommenting this would only select subreddits where the matching posts
  ---- are voted better than the average post
  ---- for that subreddit
  -- AND E.matched_avg_vote >= E.all_avg_vote
) F
WHERE F.posts_subscribers_ratio > 0 AND F.all_posts_cnt > 50
-- ORDER BY F.posts_subscribers_ratio ASC;
ORDER BY F.delta_over_average DESC;

Something more useful would also use some insight from Reddit's hot formula but I'm looking for a low cost solution (on top of this impediment, you would also need the vote data table in the reddit dataset, which is currently unavailable in the sample bigquery datasets, however.. a full data dump is available here).

There's two problems that this query does not address:

  • subreddits whose description matches the keywords
  • subreddits that have posts whose comments match the keywords

Stackexchange Data explorer & Stackoverflow data

Stackoverflow is currently the largest Q&A website for programming-related questions. It also gives users the ability to write SQL queries that run on its datasets. That's quite interesting because you can write queries and search for questions in different ways.

For example, I was interested in how big tech companies use certain technologies or what sort of problems they're solving. So I wrote a query for this:

SELECT
TOP 1000
U.profile_url,
U.reputation,
PL.PostId as [Post Link],
*
FROM (
    SELECT
    TOP 800
    ('http://stackoverflow.com/users/'+CAST(id AS nvarchar)) as profile_url,
    reputation,
    id
    FROM users
    WHERE
    (
      LOWER(aboutme) LIKE '%at google%' OR
      LOWER(aboutme) LIKE '%at facebook%' OR
      LOWER(aboutme) LIKE '%at microsoft%' OR
      LOWER(aboutme) LIKE '%ex-google%'
    ) AND
    reputation > 1000
    -- ORDER BY reputation DESC
  ) U
JOIN Posts P
ON P.OwnerUserId = U.id
JOIN PostLinks PL
ON PL.PostId = P.Id
JOIN PostTags PT
ON PT.PostId = P.Id
JOIN Tags T
ON T.Id = PT.TagId AND T.TagName IN
('postgis', 'gis', 'openstreetmap');