In this post I'm rambling about failing to find new niches & skills, and more new types of business areas from the data I'm messing with.

First I made this small table in my database like this:

column name type
id Integer
name String
json JSONB
ts Integer

I've used JSONB 1 as a data type to store weekly word counts since I wasn't really sure if I'd count 2-word grams or 1-word-grams. I did settle for just counting words.

In order to make the wordcloud I just take two consecutive weeks and compute a score where I try to emphasize the growth of that word so it shows up larger if it had a big increase over the last week.

Someone recommended we normalize the words by using the nltk.stem.snowball.EnglishStemmer and nltk.wordnet.morphy and this led to having fewer words (about 125,000 unique words), still a lot of them made no sense since they weren't actual words (they should've been checked against a dictionary/lexicon).

Anyway, so this is what the word-cloud looked like ( used d3-cloud to draw it ):

What a mess ..

I was thinking of grabbing a larger lexicon of the English language that would at least help to check words against to make sure they're valid and not just random strings(a lot of misspellings in the data).

I think the technology names (libraries/frameworks) would be in D2 without D1.

Eventually I found such a lexicon here except it's from 2013 and things have evolved since then. I didn't use the lexicon.

So the problem with the word cloud is you try to cram a lot of information in it, how many words can you really fit in a 1000x1000 rectangle? (the author of d3-cloud warns about this by linking to this article) The layout algorithm uses randomized positions and tries to fit in your words, so every time you build your word cloud you get a different one even if your data is the same.

I'll look some more into this but without the word clouds.

Footnotes:

1

The cool part about Postgres having the JSONB data type is you're not forced into a fixed structure for your dataset but you can still put GIN indexes on the JSONB columns and Postgres gives you functions to operate with that JSON data.