Querying online datasets
Introduction
Reddit and Stackoverflow are very useful datasets, the same goes for Wikipedia, and it's very interesting to have the ability to write queries that run on large datasets simply because you're able to search for things in ways you couldn't before with just a search box.
Setting up the infrastructure required to load/collect and index data, designing the database schema and importing data are steps that require quite some time and effort (this can quickly become a fulltime job, and in the case of large databases may even require a dedicated team that only takes care of this process).
If this is already taken care of and available through a service such as BigQuery (that even comes with public datasets already loaded), you can start analyzing/exploring the data right away.
Google BigQuery & Reddit data
In order to use BigQuery you need to create a project in the development console and the open one of the tables in the public dataset.
Without signing up to a free trial, or without signing up for a paid plan BigQuery will let you run a certain number of search queries per day, but even with that limitation in place, this is still quite useful.
I thought about finding subreddits where users feel more excited about the posts we're interested in than what they usually see on that subreddit. However, that caused the most relevant subs to fall beneath the average (probably because users of the most relevant sub are not so excited about posts on the most relevant topic in that subreddit).
Another way to rank them is: higher ranking for subreddits with low posts/subscribers ratio since the time a post stays on the first page is longer (because it doesn't get pushed down so fast by new posts) and it gets viewed more (since there are many subscribers).
SELECT F.*, CONCAT("http://reddit.com/r/", F.display_name) AS sub_url FROM ( -- delta_over_average measures how much above the subreddit average SELECT E.display_name AS display_name, E.all_posts_cnt AS all_posts_cnt, E.matched_posts_cnt AS matched_posts_cnt, E.matched_avg_vote AS matched_avg_vote, E.all_avg_vote AS all_avg_vote, E.subscribers AS subscribers, (E.all_posts_cnt/E.subscribers) AS posts_subscribers_ratio, (((E.matched_avg_vote - E.all_avg_vote)/E.all_avg_vote)*100.0) AS delta_over_average FROM ( SELECT subs.display_name AS display_name, matching.cnt AS matched_posts_cnt, subreddit_counts.cnt AS all_posts_cnt, matching.matched_avg_vote as matched_avg_vote, subreddit_counts.all_avg_vote as all_avg_vote, subs.subscribers as subscribers FROM ( -- find all subreddits that have posts matching the keywords -- and stats for each of those subreddits SELECT subreddit, COUNT(id) AS cnt, AVG(ups - downs) AS matched_avg_vote FROM [fh-bigquery:reddit_posts.full_corpus_201509] WHERE REGEXP_MATCH( LOWER(selftext), '(postgis|qgis|openstreetmap|spatial query|spatial index)' ) GROUP BY subreddit ) matching JOIN ( -- all subreddits and stats for each SELECT subreddit, COUNT(id) as cnt, AVG(ups - downs) AS all_avg_vote FROM [fh-bigquery:reddit_posts.full_corpus_201509] GROUP BY subreddit ) subreddit_counts ON matching.subreddit = subreddit_counts.subreddit JOIN ( -- all subreddits with a minimum number of subscribers SELECT * FROM [fh-bigquery:reddit.subreddits_201509] WHERE subscribers > 100 ) subs -- make sure the post is from that subreddit ON subreddit_counts.subreddit = subs.display_name ) E WHERE E.matched_posts_cnt > 10 ---- uncommenting this would only select subreddits where the matching posts ---- are voted better than the average post ---- for that subreddit -- AND E.matched_avg_vote >= E.all_avg_vote ) F WHERE F.posts_subscribers_ratio > 0 AND F.all_posts_cnt > 50 -- ORDER BY F.posts_subscribers_ratio ASC; ORDER BY F.delta_over_average DESC;
Something more useful would also use some insight from Reddit's hot formula but I'm looking for a low cost solution (on top of this impediment, you would also need the vote data table in the reddit dataset, which is currently unavailable in the sample bigquery datasets, however.. a full data dump is available here).
There's two problems that this query does not address:
- subreddits whose description matches the keywords
- subreddits that have posts whose comments match the keywords
Stackexchange Data explorer & Stackoverflow data
Stackoverflow is currently the largest Q&A website for programming-related questions. It also gives users the ability to write SQL queries that run on its datasets. That's quite interesting because you can write queries and search for questions in different ways.
For example, I was interested in how big tech companies use certain technologies or what sort of problems they're solving. So I wrote a query for this:
SELECT TOP 1000 U.profile_url, U.reputation, PL.PostId as [Post Link], * FROM ( SELECT TOP 800 ('http://stackoverflow.com/users/'+CAST(id AS nvarchar)) as profile_url, reputation, id FROM users WHERE ( LOWER(aboutme) LIKE '%at google%' OR LOWER(aboutme) LIKE '%at facebook%' OR LOWER(aboutme) LIKE '%at microsoft%' OR LOWER(aboutme) LIKE '%ex-google%' ) AND reputation > 1000 -- ORDER BY reputation DESC ) U JOIN Posts P ON P.OwnerUserId = U.id JOIN PostLinks PL ON PL.PostId = P.Id JOIN PostTags PT ON PT.PostId = P.Id JOIN Tags T ON T.Id = PT.TagId AND T.TagName IN ('postgis', 'gis', 'openstreetmap');