How to Develop a Word-Level Neural Language Model and Use it to Generate Text

A language model can predict the probability of the next word in the sequence, based on the words already observed in the sequence.

Neural network models are a preferred method for developing statistical language models because they can use a distributed representation where different words with similar meanings have similar representation and because they can use a large context of recently observed words when making predictions.

In this tutorial, you will discover how to develop a statistical language model using deep learning in Python.

After completing this tutorial, you will know:

  • How to prepare text for developing a word-based language model.
  • How to design and fit a neural language model with a learned embedding and an LSTM hidden layer.
  • How to use the learned language model to generate new text with similar statistical properties as the source text.

Let’s get started.

How to Develop a Word-Level Neural Language Model and Use it to Generate Text
Photo by Carlo Raso, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. The Republic by Plato
  2. Data Preparation
  3. Train Language Model
  4. Use Language Model

The Republic by Plato

The Republic is the classical Greek philosopher Plato’s most famous work.

It is structured as a dialog (e.g. conversation) on the topic of order and justice within a city state

The entire text is available for free in the public domain. It is available on the Project Gutenberg website in a number of formats.

You can download the ASCII text version of the entire book (or books) here:

Download the book text and place it in your current working directly with the filename ‘republic.txt

Open the file in a text editor and delete the front and back matter. This includes details about the book at the beginning, a long analysis, and license information at the end.

The text should begin with:


I went down yesterday to the Piraeus with Glaucon the son of Ariston,

And end with

And it shall be well with us both in this life and in the pilgrimage of a thousand years which we have been describing.

Save the cleaned version as ‘republic_clean.txt’ in your current working directory. The file should be about 15,802 lines of text.

Now we can develop a language model from this text.

Data Preparation

We will start by preparing the data for modeling.

The first step is to look at the data.

Review the Text

Open the text in an editor and just look at the text data.

For example, here is the first piece of dialog:


I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what manner they would
celebrate the festival, which was a new thing. I was delighted with the
procession of the inhabitants; but that of the Thracians was equally,
if not more, beautiful. When we had finished our prayers and viewed the
spectacle, we turned in the direction of the city; and at that instant
Polemarchus the son of Cephalus chanced to catch sight of us from a
distance as we were starting on our way home, and told his servant to
run and bid us wait for him. The servant took hold of me by the cloak
behind, and said: Polemarchus desires you to wait.

I turned round, and asked him where his master was.

There he is, said the youth, coming after you, if you will only wait.

Certainly we will, said Glaucon; and in a few minutes Polemarchus
appeared, and with him Adeimantus, Glaucon’s brother, Niceratus the son
of Nicias, and several others who had been at the procession.

Polemarchus said to me: I perceive, Socrates, that you and your
companion are already on your way to the city.

You are not far wrong, I said.

What do you see that we will need to handle in preparing the data?

Here’s what I see from a quick look:

  • Book/Chapter headings (e.g. “BOOK I.”).
  • British English spelling (e.g. “honoured”)
  • Lots of punctuation (e.g. “–“, “;–“, “?–“, and more)
  • Strange names (e.g. “Polemarchus”).
  • Some long monologues that go on for hundreds of lines.
  • Some quoted dialog (e.g. ‘…’)

These observations, and more, suggest at ways that we may wish to prepare the text data.

The specific way we prepare the data really depends on how we intend to model it, which in turn depends on how we intend to use it.

Language Model Design

In this tutorial, we will develop a model of the text that we can then use to generate new sequences of text.

The language model will be statistical and will predict the probability of each word given an input sequence of text. The predicted word will be fed in as input to in turn generate the next word.

A key design decision is how long the input sequences should be. They need to be long enough to allow the model to learn the context for the words to predict. This input length will also define the length of seed text used to generate new sequences when we use the model.

There is no correct answer. With enough time and resources, we could explore the ability of the model to learn with differently sized input sequences.

Instead, we will pick a length of 50 words for the length of the input sequences, somewhat arbitrarily.

We could process the data so that the model only ever deals with self-contained sentences and pad or truncate the text to meet this requirement for each input sequence. You could explore this as an extension to this tutorial.

Instead, to keep the example brief, we will let all of the text flow together and train the model to predict the next word across sentences, paragraphs, and even books or chapters in the text.

Now that we have a model design, we can look at transforming the raw text into sequences of 100 input words to 1 output word, ready to fit a model.

Load Text

The first step is to load the text into memory.

We can develop a small function to load the entire text file into memory and return it. The function is called load_doc() and is listed below. Given a filename, it returns a sequence of loaded text.

Using this function, we can load the cleaner version of the document in the file ‘republic_clean.txt‘ as follows:

Running this snippet loads the document and prints the first 200 characters as a sanity check.


I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what

So far, so good. Next, let’s clean the text.

Clean Text

We need to transform the raw text into a sequence of tokens or words that we can use as a source to train the model.

Based on reviewing the raw text (above), below are some specific operations we will perform to clean the text. You may want to explore more cleaning operations yourself as an extension.

  • Replace ‘–‘ with a white space so we can split words better.
  • Split words based on white space.
  • Remove all punctuation from words to reduce the vocabulary size (e.g. ‘What?’ becomes ‘What’).
  • Remove all words that are not alphabetic to remove standalone punctuation tokens.
  • Normalize all words to lowercase to reduce the vocabulary size.

Vocabulary size is a big deal with language modeling. A smaller vocabulary results in a smaller model that trains faster.

We can implement each of these cleaning operations in this order in a function. Below is the function clean_doc() that takes a loaded document as an argument and returns an array of clean tokens.

We can run this cleaning operation on our loaded document and print out some of the tokens and statistics as a sanity check.

First, we can see a nice list of tokens that look cleaner than the raw text. We could remove the ‘Book I‘ chapter markers and more, but this is a good start.

We also get some statistics about the clean document.

We can see that there are just under 120,000 words in the clean text and a vocabulary of just under 7,500 words. This is smallish and models fit on this data should be manageable on modest hardware.

Next, we can look at shaping the tokens into sequences and saving them to file.

Save Clean Text

We can organize the long list of tokens into sequences of 50 input words and 1 output word.

That is, sequences of 51 words.

We can do this by iterating over the list of tokens from token 51 onwards and taking the prior 50 tokens as a sequence, then repeating this process to the end of the list of tokens.

We will transform the tokens into space-separated strings for later storage in a file.

The code to split the list of clean tokens into sequences with a length of 51 tokens is listed below.

Running this piece creates a long list of lines.

Printing statistics on the list, we can see that we will have exactly 118,633 training patterns to fit our model.

Next, we can save the sequences to a new file for later loading.

We can define a new function for saving lines of text to a file. This new function is called save_doc() and is listed below. It takes as input a list of lines and a filename. The lines are written, one per line, in ASCII format.

We can call this function and save our training sequences to the file ‘republic_sequences.txt‘.

Take a look at the file with your text editor.

You will see that each line is shifted along one word, with a new word at the end to be predicted; for example, here are the first 3 lines in truncated form:

book i i … catch sight of
i i went … sight of us
i went down … of us from

Complete Example

Tying all of this together, the complete code listing is provided below.

You should now have training data stored in the file ‘republic_sequences.txt‘ in your current working directory.

Next, let’s look at how to fit a language model to this data.

Train Language Model

We can now train a statistical language model from the prepared data.

The model we will train is a neural language model. It has a few unique characteristics:

  • It uses a distributed representation for words so that different words with similar meanings will have a similar representation.
  • It learns the representation at the same time as learning the model.
  • It learns to predict the probability for the next word using the context of the last 100 words.

Specifically, we will use an Embedding Layer to learn the representation of words, and a Long Short-Term Memory (LSTM) recurrent neural network to learn to predict words based on their context.

Let’s start by loading our training data.

Load Sequences

We can load our training data using the load_doc() function we developed in the previous section.

Once loaded, we can split the data into separate training sequences by splitting based on new lines.

The snippet below will load the ‘republic_sequences.txt‘ data file from the current working directory.

Next, we can encode the training data.

Encode Sequences

The word embedding layer expects input sequences to be comprised of integers.

We can map each word in our vocabulary to a unique integer and encode our input sequences. Later, when we make predictions, we can convert the prediction to numbers and look up their associated words in the same mapping.

To do this encoding, we will use the Tokenizer class in the Keras API.

First, the Tokenizer must be trained on the entire training dataset, which means it finds all of the unique words in the data and assigns each a unique integer.

We can then use the fit Tokenizer to encode all of the training sequences, converting each sequence from a list of words to a list of integers.

We can access the mapping of words to integers as a dictionary attribute called word_index on the Tokenizer object.

We need to know the size of the vocabulary for defining the embedding layer later. We can determine the vocabulary by calculating the size of the mapping dictionary.

Words are assigned values from 1 to the total number of words (e.g. 7,409). The Embedding layer needs to allocate a vector representation for each word in this vocabulary from index 1 to the largest index and because indexing of arrays is zero-offset, the index of the word at the end of the vocabulary will be 7,409; that means the array must be 7,409 + 1 in length.

Therefore, when specifying the vocabulary size to the Embedding layer, we specify it as 1 larger than the actual vocabulary.

Sequence Inputs and Output

Now that we have encoded the input sequences, we need to separate them into input (X) and output (y) elements.

We can do this with array slicing.

After separating, we need to one hot encode the output word. This means converting it from an integer to a vector of 0 values, one for each word in the vocabulary, with a 1 to indicate the specific word at the index of the words integer value.

This is so that the model learns to predict the probability distribution for the next word and the ground truth from which to learn from is 0 for all words except the actual word that comes next.

Keras provides the to_categorical() that can be used to one hot encode the output words for each input-output sequence pair.

Finally, we need to specify to the Embedding layer how long input sequences are. We know that there are 100 words because we designed the model, but a good generic way to specify that is to use the second dimension (number of columns) of the input data’s shape. That way, if you change the length of sequences when preparing data, you do not need to change this data loading code; it is generic.

Fit Model

We can now define and fit our language model on the training data.

The learned embedding needs to know the size of the vocabulary and the length of input sequences as previously discussed. It also has a parameter to specify how many dimensions will be used to represent each word. That is, the size of the embedding vector space.

Common values are 50, 100, and 300. We will use 50 here, but consider testing smaller or larger values.

We will use a two LSTM hidden layers with 100 memory cells each. More memory cells and a deeper network may achieve better results.

A dense fully connected layer with 100 neurons connects to the LSTM hidden layers to interpret the features extracted from the sequence. The output layer predicts the next word as a single vector the size of the vocabulary with a probability for each word in the vocabulary. A softmax activation function is used to ensure the outputs have the characteristics of normalized probabilities.

A summary of the defined network is printed as a sanity check to ensure we have constructed what we intended.

Next, the model is compiled specifying the categorical cross entropy loss needed to fit the model. Technically, the model is learning a multi-class classification and this is the suitable loss function for this type of problem. The efficient Adam implementation to mini-batch gradient descent is used and accuracy is evaluated of the model.

Finally, the model is fit on the data for 100 training epochs with a modest batch size of 128 to speed things up.

Training may take a few hours on modern hardware without GPUs. You can speed it up with a larger batch size and/or fewer training epochs.

During training, you will see a summary of performance, including the loss and accuracy evaluated from the training data at the end of each batch update.

You will get different results, but perhaps an accuracy of just over 50% of predicting the next word in the sequence, which is not bad. We are not aiming for 100% accuracy (e.g. a model that memorized the text), but rather a model that captures the essence of the text.

Save Model

At the end of the run, the trained model is saved to file.

Here, we use the Keras model API to save the model to the file ‘model.h5‘ in the current working directory.

Later, when we load the model to make predictions, we will also need the mapping of words to integers. This is in the Tokenizer object, and we can save that too using Pickle.

Complete Example

We can put all of this together; the complete example for fitting the language model is listed below.

Use Language Model

Now that we have a trained language model, we can use it.

In this case, we can use it to generate new sequences of text that have the same statistical properties as the source text.

This is not practical, at least not for this example, but it gives a concrete example of what the language model has learned.

We will start by loading the training sequences again.

Load Data

We can use the same code from the previous section to load the training data sequences of text.

Specifically, the load_doc() function.

We need the text so that we can choose a source sequence as input to the model for generating a new sequence of text.

The model will require 100 words as input.

Later, we will need to specify the expected length of input. We can determine this from the input sequences by calculating the length of one line of the loaded data and subtracting 1 for the expected output word that is also on the same line.

Load Model

We can now load the model from file.

Keras provides the load_model() function for loading the model, ready for use.

We can also load the tokenizer from file using the Pickle API.

We are ready to use the loaded model.

Generate Text

The first step in generating text is preparing a seed input.

We will select a random line of text from the input text for this purpose. Once selected, we will print it so that we have some idea of what was used.

Next, we can generate new words, one at a time.

First, the seed text must be encoded to integers using the same tokenizer that we used when training the model.

The model can predict the next word directly by calling model.predict_classes() that will return the index of the word with the highest probability.

We can then look up the index in the Tokenizers mapping to get the associated word.

We can then append this word to the seed text and repeat the process.

Importantly, the input sequence is going to get too long. We can truncate it to the desired length after the input sequence has been encoded to integers. Keras provides the pad_sequences() function that we can use to perform this truncation.

We can wrap all of this into a function called generate_seq() that takes as input the model, the tokenizer, input sequence length, the seed text, and the number of words to generate. It then returns a sequence of words generated by the model.

We are now ready to generate a sequence of new words given some seed text.

Putting this all together, the complete code listing for generating text from the learned-language model is listed below.

Running the example first prints the seed text.

when he said that a man when he grows old may learn many things for he can no more learn much than he can run much youth is the time for any extraordinary toil of course and therefore calculation and geometry and all the other elements of instruction which are a

Then 50 words of generated text are printed.

preparation for dialectic should be presented to the name of idle spendthrifts of whom the other is the manifold and the unjust and is the best and the other which delighted to be the opening of the soul of the soul and the embroiderer will have to be said at

You will get different results. Try running the generation piece a few times.

You can see that the text seems reasonable. In fact, the addition of concatenation would help in interpreting the seed and the generated text. Nevertheless, the generated text gets the right kind of words in the right kind of order.

Try running the example a few times to see other examples of generated text. Let me know in the comments below if you see anything interesting.


This section lists some ideas for extending the tutorial that you may wish to explore.

  • Sentence-Wise Model. Split the raw data based on sentences and pad each sentence to a fixed length (e.g. the longest sentence length).
  • Simplify Vocabulary. Explore a simpler vocabulary, perhaps with stemmed words or stop words removed.
  • Tune Model. Tune the model, such as the size of the embedding or number of memory cells in the hidden layer, to see if you can develop a better model.
  • Deeper Model. Extend the model to have multiple LSTM hidden layers, perhaps with dropout to see if you can develop a better model.
  • Pre-Trained Word Embedding. Extend the model to use pre-trained word2vec or GloVe vectors to see if it results in a better model.

Further Reading

This section provides more resources on the topic if you are looking go deeper.


In this tutorial, you discovered how to develop a word-based language model using a word embedding and a recurrent neural network.

Specifically, you learned:

  • How to prepare text for developing a word-based language model.
  • How to design and fit a neural language model with a learned embedding and an LSTM hidden layer.
  • How to use the learned language model to generate new text with similar statistical properties as the source text.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.