I was recently looking into various ways of embedding unlabeled, high-dimensional data in 2 dimensions for visualization. A wide variety of methods have been proposed for this task. This Review paper from 2009 contains nice references to many of them (PCA, Kernel PCA, Isomap, LLE, Autoencoders, etc.). If you have Matlab available, the Dimensionality Reduction Toolbox has a nice implementation of many of these methods. Scikit Learn also has a brief section on Manifold Learning along with the implementation.
Long story short, I’ve implemented t-SNE in JS, released it as tsnejs on Github, and created a small demo that uses the library to visualize the top twitter accounts based on what they talk about. In this post, I thought it might be fun to document a small 1-day project like this, from beginning to end. This also gives me an opportunity to describe some of my projects toolkit, which others might find useful.
First, take a look at the final demo. To create this demo I found the top 500 most followed accounts on Twitter, downloaded 200 of their tweets and then measured differences in what they tweet about. These differences are then fed to t-SNE to produce a 2-dimensional visualization, where nearby people tweet similar things. Fun!
Fetching top tweeps
We first have to identify the top 500 tweeps. I googled “top twitter accounts” and found http://twitaholic.com/ , which lists them out. However, the accounts are embedded in the webpage and we need to extract them in structured format. For this, I love a recent YC startup Kimono; I use it extensively to scrape structured data from websites. It lets you click the elements of interest (the Twitter handles in this case), and extracts them out in JSON. Easy as pie!
Now we have a list of top 500 tweeps and we’d like to obtain their tweets to get an idea about what they tweet about. My library of choice for this task is Tweepy. Their documentation is quite terrible but if you browse the source code things seem relatively simple. Here’s an example call to get 200 tweets for a given user:
tweets = tweepy.Cursor(api.user_timeline, screen_name=user).items(200)
We iterate this over all users, extract the tweet text, and dumpt it all into files, one per account. I had to be careful with two annoyances in process:
- Twitter puts severe rate limits on API calls, so this actually took several hours to collect, wrapped up in try catch blocks and
- The returned text is in Unicode, which leads to trouble if you’re going to try to write it to file.
One solution for the second annoyance is to use the codecs library:
import codecs codecs.open(filename, 'w', 'utf-8').write(tweet_texts)
Oh, and lets also grab and save the Twitter profile pictures, which we’ll use in the visualization. An example for one user might be:
import urllib # yes I know this is deprecated userobj = api.get_user(screen_name = user) urllib.urlretrieve(imgname, userobj.profile_image_url) # save image to disk
I should mention that I write a lot of quick and dirty Python code in IPython Notebooks, which I very warmly recommend. If you’re writing all your Python in text editors, you’re seriously missing out.
Quantifying Tweep differences
We now have 500 tweeps and their 200 most recent tweets concatenated in 500 files. We’d now like to find who tweets about similar things. Scikit learn is very nice for quick NLP tasks like this. In particular, we load up all the files and create a 500-long array where every element are the 200 concatenated tweets. Then we use the TfidfVectorizer class to extract all words and bigrams from the text data, and to turn every user’s language into one tfidf vector. This vector is a fingerprint of the language that each person uses. Here’s how we can simply wire this up:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer vectorizer = TfidfVectorizer(min_df=2, stop_words = 'english', strip_accents = 'unicode', lowercase=True, ngram_range=(1,2), norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True) X = vectorizer.fit_transform(user_language_array) D = -(X * X.T).todense() # Distance matrix: dot product between tfidf vectors
In the above,
user_language_array is the 500-element array that has the concatenated tweets. The
TfidfVectorizer class looks through all tweets and takes note of all words (unigrams) and word bigrams (i.e. series of two words). It builds a dictionary out of all unigram/bigrams and essentially counts up how often every person uses each one. Here’s an example of some tweet text converted to unigram/bigrams:
The tfidf vectors are returned stacked up as rows inside
X, which has size
The Visualization parts
We now create an .html file and import jQuery (as always), and d3js, which I like to use for any kind of plotting. We load up the JSON that stores our distances and usernames with jQuery, and use d3js to initialize the SVG element that will hold all the users. For starters, we plot the users at random position but we will soon arrange them so that similar users cluster nearby with t-SNE. Inspect the code on the demo page to see the jQuery and d3js parts (Ctrl+U). In the code, we see a few things I like to use:
- I like to use Google Fonts to get prettier-than-default fonts. Here, for example I’m importing Roboto, and then using it in the CSS.
- Next, we see an import of syntaxhighlighter code which dynamically highlights code on your page.
- Then we see Google tracking JS code, which lets me track statistics for the website on Google Analytics.
- I didn’t use Bootstrap on this website because it’s very small and simple, but normally I would because this makes your website right away work nicely on mobile.
Finally we get to the meat! We need to arrange the users in our d3js plot so that similar users appear nearby. The t-SNE cost function was described in this 2008 paper by van der Maaten and Hinton. Similar to many other methods, we set up two distance metrics in the original and the embedded space and minimize their difference. In t-SNE in particular, the original space distance is based on a Gaussian distribution and the embedded space is based on the heavy-tailed Student-t distribution. The KL-divergence formulation has the nice property that it is asymmetric in how it penalizes distances between the two spaces:
- If two points are close in the original space, there is a strong attractive force between the the points in the embedding
- Conversely, if the two points are far apart in the original space, the algorithm is relatively free to place these points around.
costGrad(), which computes the cost function and the gradient of the objective. The correct implementation of this function is double checked with
step() function repeatedly (
setInterval() call), and we plot the solution as it gets computed.
Phew! Final result, again: t-SNE demo.
I hope some of the references were useful. If you use tsnejs to embed some of your data, let me know!
Bonus: Word Embedding t-SNE Visualization
I created another demo, this time to visualize word vector embeddings. Head over here to see it. The word embeddings are trained as described in this ACL 2012 paper.
The (unsupervised) objective function makes it so that words that are interchangable (i.e. occur in very similar surrounding context) are close in the embedding. This comes across in the visualization!