Datasets for Natural Language Processing

You need datasets to practice on when getting started with deep learning for natural language processing tasks.

It is better to use small datasets that you can download quickly and do not take too long to fit models. Further, it is also helpful to use standard datasets that are well understood and widely used so that you can compare your results to see if you are making progress.

In this post, you will discover a suite of standard datasets for natural language processing tasks that you can use when getting started with deep learning.

Overview

This post is divided into 7 parts; they are:

  1. Text Classification
  2. Language Modeling
  3. Image Captioning
  4. Machine Translation
  5. Question Answering
  6. Speech Recognition
  7. Document Summarization

I have tried to provide a mixture of datasets that are popular for use in academic papers that are modest in size.

Almost all datasets are freely available for download today.

If your favorite dataset is not listed or you think you know of a better dataset that should be listed, please let me know in the comments below.

Let’s get started.

Datasets for Natural Language Processing
Photo by Grant, some rights reserved.

1. Text Classification

Text classification refers to labeling sentences or documents, such as email spam classification and sentiment analysis.

Below are some good beginner text classification datasets.

For more, see the post:

2. Language Modeling

Language modeling involves developing a statistical model for predicting the next word in a sentence or next letter in a word given whatever has come before. It is a pre-cursor task in tasks like speech recognition and machine translation.

It is a pre-cursor task in tasks like speech recognition and machine translation.

Below are some good beginner language modeling datasets.

  • Project Gutenberg, a large collection of free books that can be retrieved in plain text for a variety of languages.

There are more formal corpora that are well studied; for example:

3. Image Captioning

Image captioning is the task of generating a textual description for a given image.

Below are some good beginner image captioning datasets.

  • Common Objects in Context (COCO). A collection of more than 120 thousand images with descriptions
  • Flickr 8K. A collection of 8 thousand described images taken from flickr.com.
  • Flickr 30K. A collection of 30 thousand described images taken from flickr.com.

For more see the post:

4. Machine Translation

Machine translation is the task of translating text from one language to another.

Below are some good beginner machine translation datasets.

There are a ton of standard datasets used for the annual machine translation challenges; see:

5. Question Answering

Question answering is a task where a sentence or sample of text is provided from which questions are asked and must be answered.

Below are some good beginner question answering datasets.

For more, see the post:

6. Speech Recognition

Speech recognition is the task of transforming audio of a spoken language into human readable text.

Below are some good beginner speech recognition datasets.

Do you know of some more good automatic speech recognition datasets?
Let me know in the comments.

7. Document Summarization

Document summarization is the task of creating a short meaningful description of a larger document.

Below are some good beginner document summarization datasets.

For more see:

Further Reading

This section provides additional lists of datasets if you are looking to go deeper.

Do you know of any other good lists of natural language processing datasets?
Let me know in the comments below.

Summary

In this post, you discovered a suite of standard datasets that you can use for natural language processing tasks when getting started with deep learning.

Did you pick a dataset? Are you using one of the above datasets?
Let me know in the comments below.

About Jason Brownlee

Dr. Jason Brownlee is a husband, proud father, academic researcher, author, professional developer and a machine learning practitioner. He is dedicated to helping developers get started and get good at applied machine learning.
Learn more.