22 Best OCR Datasets for Machine Learning

Many open-source datasets are available for text recognition application development. Some of the best 22 are

NIST Database

The NIST or the National Institute of Science offers a free-to-use collection of over 3600 handwriting samples with more than 810,000 character images

MNIST Database

Derived from NSIT’s Special Database 1 and 3, the MNIST database is a compiled collection of 60,000 handwritten numbers for the training set and 10,000 examples for the test set. This open-source database helps train models to recognize patterns while spending less time on pre-processing.

Text Detection

An open-source database, the Text Detection dataset contains about 500 indoor and outdoor images of signboards, door plates, caution plates, and more.

Stanford OCR

Published by Stanford, this free-to-use dataset is a handwritten word collection by the MIT Spoken Language Systems Group.

Street View Text

Gathered from Google Street View images, this dataset has text detection images mainly of boards and street-level signs.

Document Database

The Document Database is a collection of 941 handwritten documents, including tables, formulas, drawings, diagrams, lists, and more, from 189 writers.

Mathematics Expressions

The Mathematics Expressions is a database that contains 101 mathematical symbols and 10,000 expressions.

Street View House Numbers

Harvested from Google Street View, this Street View House Numbers is a database containing 73257 street house number digits.

Natural Environment OCR

The Natural Environment OCR, is a dataset of nearly 660 images worldwide and 5238 text annotations.

Mathematics Expressions

Over 10,000 expressions with 101+ math symbols.

Handwritten Chinese Characters

A dataset of 909,818 handwritten Chinese character images, equivalent to about 10 news articles.

Arabic Printed Text

A lexicon of 113,284 words using 10 Arabic fonts.

Handwritten English text

Handwritten English text on a whiteboard with over 1700 entries.

3000 environments Images

3000 images from various environments, including outdoor and indoor scenes under different lighting.

Chars74K Data

74,000 images of English and Kannada digits.

IAM (IAM Handwriting)

The IAM database has 13,353 handwritten text images by 657 writers from the Lancaster-Oslo/Bergen Corpus of British English.

FUNSD (Form Understanding in Noisy Scanned Documents)

FUNSD includes 199 annotated, scanned forms with varied and noisy appearances, challenging for form understanding.

Text OCR

TextOCR benchmarks text recognition on arbitrary shaped scene-text in natural images.

Twitter 100k

Twitter100k is a large dataset for weakly supervised cross-media retrieval.

SSIG-SegPlate – License Plate Character Segmentation (LPCS)

This dataset evaluates License Plate Character Segmentation (LPCS) with 101 daytime vehicle images.

105,941 Images Natural Scenes OCR Data of 12 Languages

The data includes 12 languages (6 Asian, 6 European) and various natural scenes and angles. It features line-level bounding boxes and text transcriptions. It is useful for multi-language OCR tasks.

Indian Signboard Image Dataset

The dataset has Indian traffic sign images for classification and detection, taken in various weather conditions during day, evening, and night.

These were some of the top open-source datasets for training ML models for text detection applications. Selecting the one that aligns with your business and application needs could take time and effort. However, you must experiment with these datasets before deciding on the appropriate one.

To help you progress toward a reliable and efficient text detection application is Shaip – the high-ranking technology solutions provider. We leverage our tech experience to create customizable, optimized, and efficient OCR training datasets for various client projects. To fully understand our capabilities, get in touch with us today.

22 Best OCR Datasets for Machine Learning

NIST Database

MNIST Database

Text Detection

Stanford OCR

Street View Text

Document Database

Mathematics Expressions

Street View House Numbers

Natural Environment OCR

Mathematics Expressions

Handwritten Chinese Characters

Arabic Printed Text

Handwritten English text

3000 environments Images

Chars74K Data

IAM (IAM Handwriting)

FUNSD (Form Understanding in Noisy Scanned Documents)

Text OCR

Twitter 100k

SSIG-SegPlate – License Plate Character Segmentation (LPCS)

105,941 Images Natural Scenes OCR Data of 12 Languages

Indian Signboard Image Dataset

Understanding the Differences for Businesses

Understanding the Newest AI Ethical Standards

What Is Electronic Health Records (EHR)?

An Unintended Standard? • AI Blog