The Wiki QA Corpus (Link)
Created to help the open-domain question and answer research, the WiKi QA Corpus is one of the most extensive publicly available datasets. Compiled from the Bing search engine query logs, it comes with question-and-answer pairs. It has more than 3000 questions and 1500 labeled answer sentences.
Legal Case Reports Dataset (Link)
Legal Case Reports dataset has a collection of 4000 legal cases and can be used to train for automatic text summarization and citation analysis. Each document, catchphrases, citation classes, citation catchphrases, and more are used.
Jeopardy (Link)
Jeopardy dataset is a collection of more than 200,000 questions featured in the popular quiz TV show brought together by a Reddit user. Each data point is classified by its aired date, episode number, value, round, and question/answer.
20 Newsgroups (Link)
A collection of 20,000 documents encompasses 20 newsgroups and subjects, detailing topics from religion to popular sports.
Reuters News Dataset (Link)
First appearing in 1987, this dataset has been labeled, indexed, and compiled for machine learning purposes.
ArXiv (Link)
This substantial 270 GB dataset includes the complete text of all arXiv research papers.
European Parliament Proceedings Parallel Corpus (Link)
Sentence pairs from Parliament proceedings include entries from 21 European languages, featuring some less common languages for machine learning corpora.
Billion Word Benchmark (Link)
Derived from the WMT 2011 News Crawl, this language modeling dataset comprises nearly one billion words for testing innovative language modeling techniques.
Spoken Wikipedia Corpora (Link)
This dataset is perfect for everyone looking to go beyond the English language. This dataset has a collection of articles spoken in Dutch and German and English. It has a diverse range of topics and speaker sets running into hundreds of hours.
2000 HUB5 English (Link)
The 2000 HUB5 English dataset has 40 telephone conversation transcripts in the English language. The data is provided by the National Institute of Standards and Technology, and its main focus is on recognizing conversational speech and converting speech into text.
LibriSpeech (Link)
LibriSpeech dataset is a collection of almost 1000 hours of English speech taken and properly segmented by topics into chapters from audio books, making it a perfect tool for Natural Language Processing.
Free Spoken Digit Dataset (Link)
This NLP dataset includes more than 1,500 recordings of spoken digits in English.
M-AI Labs Speech Dataset (Link)
The dataset offers nearly 1,000 hours of audio with transcriptions, encompassing multiple languages and categorized by male, female, and mixed voices.
Noisy Speech Database (link)
This dataset features parallel noisy and clean speech recordings, intended for speech enhancement software development but also beneficial for training on speech in challenging conditions.
Yelp Reviews (Link)
The Yelp dataset has a vast collection of about 8.5 million reviews of 160,000 plus businesses, their reviews, and user data. The reviews can be used to train your models on sentiment analysis. Besides, this dataset also has more than 200,000 pictures covering eight metropolitan locations.
IMDB Reviews (Link)
IMDB reviews are among the most popular datasets containing cast information, ratings, description, and genre for more than 50 thousand movies. This dataset can be used to test and train your machine learning models.
Amazon Reviews and Ratings Dataset (Link)
Amazon review and rating dataset contain a valuable collection of metadata and reviews of different products from Amazon collected from 1996 to 2014 – about 142.8 million records. The metadata includes the price, product description, brand, category, and more, while the reviews have text quality, the text’s usefulness, ratings, and more.
As we go, we will leave you with a pro-tip.
Make sure to thoroughly go through the README file before picking an NLP dataset for your needs. The dataset will contain all the necessary information you might require, such as the dataset’s content, the various parameters on which the data has been categorized, and the probable use cases of the dataset.
Regardless of the models you build, there is an exciting prospect of integrating our machines more closely and intrinsically with our lives. With NLP, the possibilities for business, movies, speech recognition, finance, and more are increased manifold.