How to Learn to Add Numbers with seq2seq Recurrent Neural Networks

Long Short-Term Memory (LSTM) networks are a type of Recurrent Neural Network (RNN) that are capable of learning the relationships between elements in an input sequence.

A good demonstration of LSTMs is to learn how to combine multiple terms together using a mathematical operation like a sum and outputting the result of the calculation.

A common mistake made by beginners is to simply learn the mapping function from input term to the output term. A good demonstration of LSTMs on such a problem involves learning the sequenced input of characters (“50+11”) and predicting the sequence output in characters (“61”). This hard problem can be learned with LSTMs using the sequence-to-sequence, or seq2seq (encoder-decoder), stacked LSTM configuration.

In this tutorial, you will discover how to address the problem of adding sequences of randomly generated integers using LSTMs.

After completing this tutorial, you will know:

  • How to learn the naive mapping function of input terms to output terms for addition.
  • How to frame the addition problem (and similar problems) and suitably encode inputs and outputs.
  • How to address the true sequence-prediction addition problem using the seq2seq paradigm.

Let’s get started.

How to Learn to Add Numbers with seq2seq Recurrent Neural Networks
Photo by Lima Pix, some rights reserved.

Tutorial Overview

This tutorial is divided into 3 parts; they are:

  1. Adding Numbers
  2. Addition as a Mapping Problem (the beginner’s mistake)
  3. Addition as a seq2seq Problem


This tutorial assumes a Python 2 or Python 3 development environment with SciPy, NumPy, Pandas installed.

The tutorial also assumes scikit-learn and Keras v2.0+ are installed with either the Theano or TensorFlow backend.

If you need help with your environment, see the post:

Adding Numbers

The task is that, given a sequence of randomly selected integers, to return the sum of those integers.

For example, given 10 + 5, the model should output 15.

The model is to be both trained and tested on randomly generated examples so that the general problem of adding numbers is learned, rather than memorization of specific cases.

Addition as a Mapping Problem
(the beginner’s mistake)

In this section, we will work through the problem and solve it using an LSTM and show how easy it is to make the beginner’s mistake and not harness the power of recurrent neural networks.

Data Generation

Let’s start off by defining a function to generate sequences of random integers and their sum as training and test data.

We can use the randint() function to generate random integers between a min and max value, such as between 1 and 100. We can then sum the sequence. This process can be repeated for a fixed number of times to create pairs of input sequences of numbers and matching output summed values.

For example, this snippet will create 100 examples of adding 2 numbers between 1 and 100:

Running the example will print each input-output pair.

Once we have the patterns, we can convert the lists to NumPy Arrays and rescale the values. We must rescale the values to fit within the bounds of the activation used by the LSTM.

For example:

Putting this all together, we can define the function random_sum_pairs() that takes a specified number of examples, a number of integers in each sequence, and the largest integer to generate and return X, y pairs of data for modeling.

We may want to invert the rescaling of numbers later. This will be useful to compare predictions to expected values and get an idea of an error score in the same units as the original data.

The invert() function below inverts the normalization of predicted and expected values passed in.

Configure LSTM

We can now define an LSTM to model this problem.

It’s a relatively simple problem, so the model does not need to be very large. The input layer will expect 1 input feature and 2 time steps (in the case of adding two numbers).

Two hidden LSTM layers are defined, the first with 6 units and the second with 2 units, followed by a fully connected output layer that returns a single sum value.

The efficient ADAM optimization algorithm is used to fit the model along with the mean squared error loss function given the real valued output of the network.

The network is fit for 50 epochs, new examples are generated each epoch and weight updates are performed after every 2 examples.

LSTM Evaluation

We evaluate the network on 100 new patterns.

These are generated and a sum value is predicted for each. Both the actual and predicted sum values are rescaled to the original range and a Root Mean Squared Error (RMSE) score is calculated that has the same scale as the original values. Finally, some 20 examples of expected and predicted values are listed as examples.

Finally, 20 examples of expected and predicted values are listed as examples.

Complete Example

We can tie this all together. The complete code example is listed below.

Running the example prints some loss information each epoch and finishes by printing the RMSE for the run and some example outputs.

The results are not perfect, but many examples are predicted correctly.

Your specific outputs may differ given the stochastic nature of neural networks.

The Beginner’s Mistake

All done, right?


The problem we have solved had multiple inputs but was technically not a sequence prediction problem.

In fact, you can just as easily solve it using a multilayer Perceptron (MLP). For example:

Running the example solves the problem perfectly, and in fewer epochs.

The issue is that we encoded so much of the domain into the problem that it turned the problem from a sequence prediction problem into a function mapping problem.

That is, the order of the input no longer matters. We could shuffle it up any way we want and still learn the problem.

MLPs are designed to learn mapping functions and can easily nail the problem of learning how to add numbers.

On one hand, this is a better way to approach the specific problem of adding numbers because the model is simpler and the results are better. On the other, it is a terrible use of recurrent neural networks.

This is a beginner’s mistake I see replicated in many “introduction to LSTMs” around the web.

Addition as a Sequence Prediction Problem

There is another way to frame addition that makes it an unambiguous sequence prediction problem, and in turn makes it much harder to solve.

We can frame addition as an input and output string of characters and let the model figure out the meaning of the characters. The entire addition problem can be framed as a string of characters, such as “12+50” with the output “62”, or more specifically:

  • Input: [‘1’, ‘2’, ‘+’, ‘5’, ‘0’]
  • Output: [‘6’, ‘2’]

The model must learn not only the integer nature of the characters, but also the nature of the mathematical operation to perform.

Notice how sequence is now important, and that randomly shuffling the input will create a nonsense sequence that could not be related to the output sequence.

Also notice how the problem has transformed to have both an input and an output sequence. This is called a sequence-to-sequence prediction problem, or a seq2seq problem.

We can keep things simple with addition of two numbers, but we can see how this may be scaled to a variable number of terms and mathematical operations that could be given as input for the model to learn and generalize.

Note that this formation and the rest of this example is inspired by the addition seq2seq example in the Keras project, although I re-developed it from the ground up.

Data Generation

Data generation for the seq2seq definition of the problem is a lot more involved.

We will develop each piece as a standalone function so you can play with them and understand how they work. Hang in there.

The first step is to generate sequences of random integers and their sum, as before, but with no normalization. We can put this in a function named random_sum_pairs(), as follows.

Running just this function prints a single example of adding two random integers between 1 and 10.

The next step is to convert the integers to strings. The input string will have the format ’99+99′ and the output string will have the format ’99’.

Key to this function is the padding of numbers to ensure that each input and output sequence has the same number of characters. A padding character should be different from the data so the model can learn to ignore them. In this case, we use the space character for padding(‘ ‘) and pad the string on the left, keeping the information on the far right.

There are other ways to pad, such as padding each term individually. Try it and see if it results in better performance. Report your results in the comments below.

Padding requires we know how long the longest sequence may be. We can calculate this easily by taking the log10() of the largest integer we can generate and the ceiling of that number to get an idea of how many chars are needed for each number. We add 1 to the largest number to ensure we expect 3 chars instead of 2 chars for the case of a round largest number, like 200. We then need to add the right number of plus symbols.

A similar process is repeated on the output sequence, without the plus symbols of course.

The example below adds the to_string() function and demonstrates its usage with a single input/output pair.

Running this example first prints the integer sequence and the padded string representation of the same sequence.

Next, we need to encode each character in the string as an integer value. We have to work with numbers in neural networks after all, not characters.

Integer encoding transforms the problem into a classification problem, where the output sequence may be considered class outputs with 11 possible values each. This just so happens to be integers with some ordinal relationship (first 10 class values).

To perform this encoding, we must define the full alphabet of symbols that may appear in the string encoding, as follows:

Integer encoding then becomes a simple process of building a lookup table of character to integer offset and converting each char of each string, one by one.

The example below provides the integer_encode() function for integer encoding and demonstrates how to use it.

Running the example prints the integer encoded version of each string encoded pattern.

We can see that the space character (‘ ‘) was encoded with 11 and the three character (‘3’) was encoded as 3, and so on.

The next step is to binary encode the integer encoding sequences.

This involves converting each integer to a binary vector with the same length as the alphabet and marking the specific integer with a 1.

For example, a 0 integer represents the ‘0’ character and would be encoded as a binary vector with a 1 in the 0th position of an 11 element vector: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].

The example below defines the one_hot_encode() function for binary encoding and demonstrates how to use it.

Running the example prints the binary encoded sequence for each integer encoding.

I’ve added some new lines to make the input and output binary encodings clearer.

You can see that a single sum pattern becomes a sequence of 5 binary encoded vectors, each with 11 elements. The output or sum becomes a sequence of 2 binary encoded vectors, again each with 11 elements.

We can tie all of these steps together into a function called generate_data(), listed below.

Finally, we need to invert the encoding to convert the output vectors back into numbers so we can compare expected output integers to predicted integers.

The invert() function below performs this operation. Key is first converting the binary encoding back into an integer using the argmax() function, then converting the integer back into a character using a reverse mapping of the integers to chars from the alphabet.

We now have everything we need to prepare data for this example.

Note, these functions were written for this post and I did not write any unit tests nor battle test them with all kinds of inputs. If you see or find an obvious bug, please let me know in the comments below.

Configure and Fit a seq2seq LSTM Model

We can now fit an LSTM model to this problem.

We can think of the model as being comprised of two key parts: the encoder and the decoder.

First, the input sequence is shown to the network one encoded character at a time. We need an encoding level to learn the relationship between the steps in the input sequence and develop an internal representation of these relationships.

The input to the network (for the two number case) is a series of 5 encoded characters (2 for each integer and one for the ‘+’) where each vector contains 11 features for the 11 possible characters that each item in the sequence may be.

The encoder will use a single LSTM hidden layer with 100 units.

The decoder must transform the learned internal representation of the input sequence into the correct output sequence. For this, we will use a hidden layer LSTM with 50 units, followed by an output layer.

The problem is defined as requiring two binary output vectors for the two output characters. We will use the same fully connected layer (Dense) to output each binary vector. To use the same layer twice, we will wrap it in a TimeDistributed() wrapper layer.

The output fully connected layer will use a softmax activation function to output values in the range [0,1].

There’s a problem though.

We must connect the encoder to the decoder and they do not fit.

That is, the encoder will produce a 2-dimensional matrix of 100 outputs for each input in the sequence of 5 vectors. The decoder is an LSTM layer that expects a 3D input of [samples, timesteps, features] in order to produce a decoded sequence of 1 sample with 2 timesteps each with 11 features.

If you try to force these pieces together, you get an error like:

Exactly as we would expect.

We can solve this using a RepeatVector layer. This layer simply repeats the provided 2D input n-times to create a 3D output.

The RepeatVector layer can be used like an adapter to fit the encoder and decoder parts of the network together. We can configure the RepeatVector to repeat the input 2 times. This creates a 3D output comprised of two copies of the sequence output from the encoder, that we can decode two times using the same fully connected layer for each of the two desired output vectors.

The problem is framed as a classification problem with 11 classes, therefore we can optimize the log loss (categorical_crossentropy) function and even track accuracy as well as loss on each training epoch.

Putting this together, we have:

Why Use a RepeatVector Layer?

Why not return the sequence output from the encoder as input for the decoder?

That is, one output for each LSTM at each input sequence time step rather than one output for each LSTM for the whole input sequence.

An output for each step of the input sequence gives the decoder access to the intermediate representation of the input sequence each step. This may or may not be useful. Providing the final LSTM output at the end of the input sequence may be more logical as it captures information about the entire input sequence, ready to map to or calculate an output.

Also, this leaves nothing in the network to specify the size of the decoder other than the input, giving one output value for each timestep of the input sequence (5 instead of 2).

You could reframe the output to be a sequence of 5 characters padded with whitespace. The network would be doing more work than is required and may lose some of the compression type capability provided by the encoder-decoder paradigm. Try it and see.

The issue titled “is the Sequence to Sequence learning right?” on the Keras GitHub project provides some good discussions of alternate representations you could play with.

Evaluate LSTM Model

As before, we can generate a new batch of examples and evaluate the algorithm after it has been fit.

We could calculate an RMSE score on the prediction, although I have left it out for simplicity here.

Complete Example

Putting it all together, the complete example is listed below.

Running the example nearly perfectly fits the problem. In fact, running for more epochs or increasing weight updates to every epoch (batch_size=1) will get you there, but will take 10 times longer to train.

We can see that the predicted outcome matches the expected outcome on the first 20 examples we look at.


This section lists some natural extensions to this tutorial that you may wish to explore.

  • Integer Encoding. Explore whether the problem can learn the problem better using an integer encoding alone. The ordinal relationship between most of the inputs may prove very useful.
  • Variable Numbers. Change the example to support a variable number of terms on each input sequence. This should be straightforward as long as you perform sufficient padding.
  • Variable Mathematical Operations. Change the example to vary the mathematical operation to allow the network to generalize even further.
  • Brackets. Allow the use of brackets along with other mathematical operations.

Did you try any of these extensions?
Share your findings in the comments; I’d love to see what you found.

Further Reading

This section lists some resources for further reading and other related examples you may find useful.


Code and Posts


In this tutorial, you discovered how to develop an LSTM network to learn how to add random integers together using the seq2seq stacked LSTM paradigm.

Specifically, you learned:

  • How to learn the naive mapping function of input terms to output terms for addition.
  • How to frame the addition problem (and similar problems) and suitably encode inputs and outputs.
  • How to address the true sequence-prediction addition problem using the seq2seq paradigm.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.