Many open-source datasets are available for text recognition application development. Some of the best 22 are
NIST Database
The NIST or the National Institute of Science offers a free-to-use collection of over 3600 handwriting samples with more than 810,000 character images
MNIST Database
Derived from NSIT’s Special Database 1 and 3, the MNIST database is a compiled collection of 60,000 handwritten numbers for the training set and 10,000 examples for the test set. This open-source database helps train models to recognize patterns while spending less time on pre-processing.
Text Detection
An open-source database, the Text Detection dataset contains about 500 indoor and outdoor images of signboards, door plates, caution plates, and more.
Stanford OCR
Published by Stanford, this free-to-use dataset is a handwritten word collection by the MIT Spoken Language Systems Group.
Street View Text
Gathered from Google Street View images, this dataset has text detection images mainly of boards and street-level signs.
Document Database
The Document Database is a collection of 941 handwritten documents, including tables, formulas, drawings, diagrams, lists, and more, from 189 writers.
Mathematics Expressions
The Mathematics Expressions is a database that contains 101 mathematical symbols and 10,000 expressions.
Street View House Numbers
Harvested from Google Street View, this Street View House Numbers is a database containing 73257 street house number digits.
Natural Environment OCR
The Natural Environment OCR, is a dataset of nearly 660 images worldwide and 5238 text annotations.
Mathematics Expressions
Over 10,000 expressions with 101+ math symbols.
Handwritten Chinese Characters
A dataset of 909,818 handwritten Chinese character images, equivalent to about 10 news articles.
Arabic Printed Text
A lexicon of 113,284 words using 10 Arabic fonts.
Handwritten English text
Handwritten English text on a whiteboard with over 1700 entries.
3000 environments Images
3000 images from various environments, including outdoor and indoor scenes under different lighting.
Chars74K Data
74,000 images of English and Kannada digits.
IAM (IAM Handwriting)
The IAM database has 13,353 handwritten text images by 657 writers from the Lancaster-Oslo/Bergen Corpus of British English.
FUNSD (Form Understanding in Noisy Scanned Documents)
FUNSD includes 199 annotated, scanned forms with varied and noisy appearances, challenging for form understanding.
Text OCR
TextOCR benchmarks text recognition on arbitrary shaped scene-text in natural images.
Twitter 100k
Twitter100k is a large dataset for weakly supervised cross-media retrieval.
SSIG-SegPlate – License Plate Character Segmentation (LPCS)
This dataset evaluates License Plate Character Segmentation (LPCS) with 101 daytime vehicle images.
105,941 Images Natural Scenes OCR Data of 12 Languages
The data includes 12 languages (6 Asian, 6 European) and various natural scenes and angles. It features line-level bounding boxes and text transcriptions. It is useful for multi-language OCR tasks.
Indian Signboard Image Dataset
The dataset has Indian traffic sign images for classification and detection, taken in various weather conditions during day, evening, and night.
These were some of the top open-source datasets for training ML models for text detection applications. Selecting the one that aligns with your business and application needs could take time and effort. However, you must experiment with these datasets before deciding on the appropriate one.
To help you progress toward a reliable and efficient text detection application is Shaip – the high-ranking technology solutions provider. We leverage our tech experience to create customizable, optimized, and efficient OCR training datasets for various client projects. To fully understand our capabilities, get in touch with us today.
Leave a Reply