Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.
They are a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.
In this post, you will discover the word embedding approach for representing text data.
After completing this post, you will know:
- What the word embedding approach for representing text is and how it differs from other feature extraction methods.
- That there are 3 main algorithms for learning a word embedding from text data.
- That you can either train a new embedding or use a pre-trained embedding on your natural language processing task.
Let’s get started.
This post is divided into 3 parts; they are:
- What Are Word Embeddings?
- Word Embedding Algorithms
- Using Word Embeddings
What Are Word Embeddings?
A word embedding is a learned representation for text where words that have the same meaning have a similar representation.
It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.
One of the benefits of using dense and low-dimensional vectors is computational: the majority of neural network toolkits do not play well with very high-dimensional, sparse vectors. … The main benefit of the dense representations is generalization power: if we believe some features may provide similar clues, it is worthwhile to provide a representation that is able to capture these similarities.
— Page 92, Neural Network Methods in Natural Language Processing, 2017.
Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network, and hence the technique is often lumped into the field of deep learning.
Key to the approach is the idea of using a dense distributed representation for each word.
Each word is represented by a real-valued vector, often tens or hundreds of dimensions. This is contrasted to the thousands or millions of dimensions required for sparse word representations, such as a one-hot encoding.
associate with each word in the vocabulary a distributed word feature vector … The feature vector represents different aspects of the word: each word is associated with a point in a vector space. The number of features … is much smaller than the size of the vocabulary
— A Neural Probabilistic Language Model, 2003.
The distributed representation is learned based on the usage of words. This allows words that are used in similar ways to result in having similar representations, naturally capturing their meaning. This can be contrasted with the crisp but fragile representation in a bag of words model where, unless explicitly managed, different words have different representations, regardless of how they are used.
There is deeper linguistic theory behind the approach, namely the “distributional hypothesis” by Zellig Harris that could be summarized as: words that have similar context will have similar meanings. For more depth see Harris’ 1956 paper “Distributional structure“.
This notion of letting the usage of the word define its meaning can be summarized by an oft repeated quip by John Firth:
You shall know a word by the company it keeps!
— Page 11, “A synopsis of linguistic theory 1930-1955“, in Studies in Linguistic Analysis 1930-1955, 1962.
Word Embedding Algorithms
Word embedding methods learn a real-valued vector representation for a predefined fixed sized vocabulary from a corpus of text.
The learning process is either joint with the neural network model on some task, such as document classification, or is an unsupervised process, using document statistics.
This section reviews three techniques that can be used to learn a word embedding from text data.
1. Embedding Layer
An embedding layer, for lack of a better name, is a word embedding that is learned jointly with a neural network model on a specific natural language processing task, such as language modeling or document classification.
It requires that document text be cleaned and prepared such that each word is one-hot encoded. The size of the vector space is specified as part of the model, such as 50, 100, or 300 dimensions. The vectors are initialized with small random numbers. The embedding layer is used on the front end of a neural network and is fit in a supervised way using the Backpropagation algorithm.
… when the input to a neural network contains symbolic categorical features (e.g. features that take one of k distinct symbols, such as words from a closed vocabulary), it is common to associate each possible feature value (i.e., each word in the vocabulary) with a d-dimensional vector for some d. These vectors are then considered parameters of the model, and are trained jointly with the other parameters.
— Page 49, Neural Network Methods in Natural Language Processing, 2017.
The one-hot encoded words are mapped to the word vectors. If a multilayer Perceptron model is used, then the word vectors are concatenated before being fed as input to the model. If a recurrent neural network is used, then each word may be taken as one input in a sequence.
This approach of learning an embedding layer requires a lot of training data and can be slow, but will learn an embedding both targeted to the specific text data and the NLP task.
Word2Vec is a statistical method for efficiently learning a standalone word embedding from a text corpus.
It was developed by Tomas Mikolov, et al. at Google in 2013 as a response to make the neural-network-based training of the embedding more efficient and since then has become the de facto standard for developing pre-trained word embedding.
Additionally, the work involved analysis of the learned vectors and the exploration of vector math on the representations of words. For example, that subtracting the “man-ness” from “King” and adding “women-ness” results in the word “Queen“, capturing the analogy “king is to queen as man is to woman“.
We find that these representations are surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset. This allows vector-oriented reasoning based on the offsets between words. For example, the male/female relationship is automatically learned, and with the induced vector representations, “King – Man + Woman” results in a vector very close to “Queen.”
— Linguistic Regularities in Continuous Space Word Representations, 2013.
Two different learning models were introduced that can be used as part of the word2vec approach to learn the word embedding; they are:
- Continuous Bag-of-Words, or CBOW model.
- Continuous Skip-Gram Model.
The CBOW model learns the embedding by predicting the current word based on its context. The continuous skip-gram model learns by predicting the surrounding words given a current word.
The continuous skip-gram model learns by predicting the surrounding words given a current word.
Both models are focused on learning about words given their local usage context, where the context is defined by a window of neighboring words. This window is a configurable parameter of the model.
The size of the sliding window has a strong effect on the resulting vector similarities. Large windows tend to produce more topical similarities […], while smaller windows tend to produce more functional and syntactic similarities.
— Page 128, Neural Network Methods in Natural Language Processing, 2017.
The key benefit of the approach is that high-quality word embeddings can be learned efficiently (low space and time complexity), allowing larger embeddings to be learned (more dimensions) from much larger corpora of text (billions of words).
The Global Vectors for Word Representation, or GloVe, algorithm is an extension to the word2vec method for efficiently learning word vectors, developed by Pennington, et al. at Stanford.
Classical vector space model representations of words were developed using matrix factorization techniques such as Latent Semantic Analysis (LSA) that do a good job of using global text statistics but are not as good as the learned methods like word2vec at capturing meaning and demonstrating it on tasks like calculating analogies (e.g. the King and Queen example above).
GloVe is an approach to marry both the global statistics of matrix factorization techniques like LSA with the local context-based learning in word2vec.
Rather than using a window to define local context, GloVe constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus. The result is a learning model that may result in generally better word embeddings.
GloVe, is a new global log-bilinear regression model for the unsupervised learning of word representations that outperforms other models on word analogy, word similarity, and named entity recognition tasks.
— GloVe: Global Vectors for Word Representation, 2014.
Using Word Embeddings
You have some options when it comes time to using word embeddings on your natural language processing project.
This section outlines those options.
1. Learn an Embedding
You may choose to learn a word embedding for your problem.
This will require a large amount of text data to ensure that useful embeddings are learned, such as millions or billions of words.
You have two main options when training your word embedding:
- Learn it Standalone, where a model is trained to learn the embedding, which is saved and used as a part of another model for your task later. This is a good approach if you would like to use the same embedding in multiple models.
- Learn Jointly, where the embedding is learned as part of a large task-specific model. This is a good approach if you only intend to use the embedding on one task.
2. Reuse an Embedding
It is common for researchers to make pre-trained word embeddings available for free, often under a permissive license so that you can use them on your own academic or commercial projects.
For example, both word2vec and GloVe word embeddings are available for free download.
These can be used on your project instead of training your own embeddings from scratch.
You have two main options when it comes to using pre-trained embeddings:
- Static, where the embedding is kept static and is used as a component of your model. This is a suitable approach if the embedding is a good fit for your problem and gives good results.
- Updated, where the pre-trained embedding is used to seed the model, but the embedding is updated jointly during the training of the model. This may be a good option if you are looking to get the most out of the model and embedding on your task.
Which Option Should You Use?
Explore the different options, and if possible, test to see which gives the best results on your problem.
Perhaps start with fast methods, like using a pre-trained embedding, and only use a new embedding if it results in better performance on your problem.
This section provides more resources on the topic if you are looking go deeper.
In this post, you discovered Word Embeddings as a representation method for text in deep learning applications.
Specifically, you learned:
- What the word embedding approach for representation text is and how it differs from other feature extraction methods.
- That there are 3 main algorithms for learning a word embedding from text data.
- That you you can either train a new embedding or use a pre-trained embedding on your natural language processing task.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.