Teacher forcing is a method for quickly and efficiently training recurrent neural network models that use the output from a prior time step as input.
It is a network training method critical to the development of deep learning language models used in machine translation, text summarization, and image captioning, among many other applications.
In this post, you will discover the teacher forcing as a method for training recurrent neural networks.
After reading this post, you will know:
 The problem with training recurrent neural networks that use output from prior time steps as input.
 The teacher forcing method for addressing slow convergence and instability when training these types of recurrent networks.
 Extensions to teacher forcing that allow trained models to better handle open loop applications of this type of network.
Let’s get started.
Using Output as Input in Sequence Prediction
There are sequence prediction models that use the output from the last time step y(t1) as input for the model at the current time step X(t).
This type of model is common in language models that output one word at a time and use the output word as input for generating the next word in the sequence.
For example, this type of language model is used in an EncoderDecoder recurrent neural network architecture for sequencetosequence generation problems such as:
 Machine Translation
 Caption Generation
 Text Summarization
After the model is trained, a “startofsequence” token can be used to start the process and the generated word in the output sequence is used as input on the subsequent time step, perhaps along with other input like an image or a source text.
This same recursive outputasinput process can be used when training the model, but it can result in problems such as:
 Slow convergence.
 Model instability.
 Poor skill.
Teacher forcing is an approach to improve model skill and stability when training these types of models.
What is Teacher Forcing?
Teacher forcing is a strategy for training recurrent neural networks that uses model output from a prior time step as an input.
Models that have recurrent connections from their outputs leading back into the model may be trained with teacher forcing.
— Page 372, Deep Learning, 2016.
The approach was originally described and developed as an alternative technique to backpropagation through time for training a recurrent neural network.
An interesting technique that is frequently used in dynamical supervised learning tasks is to replace the actual output y(t) of a unit by the teacher signal d(t) in subsequent computation of the behavior of the network, whenever such a value exists. We call this technique teacher forcing.
— A Learning Algorithm for Continually Running Fully Recurrent Neural Networks, 1989.
Teacher forcing works by using the actual or expected output from the training dataset at the current time step y(t) as input in the next time step X(t+1), rather than the output generated by the network.
Teacher forcing is a procedure […] in which during training the model receives the ground truth output y(t) as input at time t + 1.
— Page 372, Deep Learning, 2016.
Worked Example
Let’s make teacher forcing concrete with a short worked example.
Given the following input sequence:

Mary had a little lamb whose fleece was white as snow 
Imagine we want to train a model to generate the next word in the sequence given the previous sequence of words.
First, we must add a token to signal the start of the sequence and another to signal the end of the sequence. We will use “[START]” and “[END]” respectively.

[START] Mary had a little lamb whose fleece was white as snow [END] 
Next, we feed the model “[START]” and let the model generate the next word.
Imagine the model generates the word “a“, but of course, we expected “Mary“.
Naively, we could feed in “a” as part of the input to generate the subsequent word in the sequence.
You can see that the model is off track and is going to get punished for every subsequent word it generates. This makes learning slower and the model unstable.
Instead, we can use teacher forcing.
In the first example when the model generated “a” as output, we can discard this output after calculating error and feed in “Mary” as part of the input on the subsequent time step.
We can then repeat this process for each inputoutput pair of words.

X, yhat [START], ? [START], Mary, ? [START], Mary, had, ? [START], Mary, had, a, ? … 
The model will learn the correct sequence, or correct statistical properties for the sequence, quickly.
Extensions to Teacher Forcing
Teacher forcing is a fast and effective way to train a recurrent neural network that uses output from prior time steps as input to the model.
But, the approach can also result in models that may be fragile or limited when used in practice when the generated sequences vary from what was seen by the model during training.
This is common in most applications of this type of model as the outputs are probabilistic in nature. This type of application of the model is often called open loop.
Unfortunately, this procedure can result in problems in generation as small prediction error compound in the conditioning context. This can lead to poor prediction performance as the RNN’s conditioning context (the sequence of previously generated samples) diverge from sequences seen during training.
– Professor Forcing: A New Algorithm for Training Recurrent Networks, 2016.
There are a number of approaches to address this limitation, for example:
Search Candidate Output Sequences
One approach commonly used for models that predict a discrete value output, such as a word, is to perform a search across the predicted probabilities for each word to generate a number of likely candidate output sequences.
This approach is used on problems like machine translation to refine the translated output sequence.
A common search procedure for this posthoc operation is the beam search.
This discrepancy can be mitigated by the use of a beam search heuristic maintaining several generated target sequences
— Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, 2015.
Curriculum Learning
The beam search approach is only suitable for prediction problems with discrete output values and cannot be used for realvalued outputs.
A variation of forced learning is to introduce outputs generated from prior time steps during training to encourage the model to learn how to correct its own mistakes.
We propose to change the training process in order to gradually force the model to deal with its own mistakes, as it would have to during inference.
— Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, 2015.
The approach is called curriculum learning and involves randomly choosing to use the ground truth output or the generated output from the previous time step as input for the current time step.
The curriculum changes over time in what is called scheduled sampling where the procedure starts at forced learning and slowly decreases the probability of a forced input over the training epochs.
There are also other extensions and variations of teacher forcing and I encourage you to explore them if you are interested.
Further Reading
This section provides more resources on the topic if you are looking go deeper.
Papers
Books
 Section 10.2.1, Teacher Forcing and Networks with Output Recurrence, Deep Learning, 2016.
Summary
In this post, you discovered teacher forcing as a method for training recurrent neural networks that use output from a previous time step as input.
Specifically, you learned:
 The problem with training recurrent neural networks that use output from prior time steps as input.
 The teacher forcing method for addressing slow convergence and instability when training these types of recurrent networks.
 Extensions to teacher forcing that allow trained models to better handle open loop applications of this type of network.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.