8 popular free public data set for machine learning and deep learning

The key to master machine learning is through practice. It is only through experience that you will come to understand the true capability of machine learning and its limitations. I have curated this list of publicly downloadable high quality data sets so that you can learn and hone your skills.

The 8 data sets below consist of 4 main category, image, audio,signal and natural language processing. Each of these real-world data has its own unique nuance and approach for artificial intelligence and machine learning experimentation.

Image

MNIST is a popular handwritten digits data set that contains a training set of 60,000 handwritten numbers and a test set of 10,000 handwritten numbers. It’s a great database to test your machine learning and pattern recognition algorithms.

Fashion MNIST contains a training set of 60,000 product images and testing set of 10,000 product images. It is an alternative to MNIST database as some believed MNIST the handwritten digits data has been overused. The images are in grey scale with 10 different product category. eg. Shirt, Dress, Sneaker, Bag, Boots..

SVHN (Street View House Numbers) is a real-world image data set for developing digits recognition algorithms. The data was obtained from house numbers viewed in Google Street View. It is significantly harder, real world problem compared to the binary number images in the MNIST data.

Audio and Signal

FSDD (Free Spoken Digit Dataset) is another MNIST inspired data set. This is an English audio/speech data set that consists of spoken digits recordings in wav files of 8kHz. It was created for the task for identify spoken digits in audio samples. Currently it has 2,000 recordings from 4 speakers.

ECG Heartbeat Categorization Dataset a large library of prepossessed clinical records derived two famous datasets in heartbeat classification, the MIT-BIH Arrhythmia Dataset and The PTB Diagnostic ECG Database. Arrhythmia Dataset has 109446 samples while the PTB Diagnostic ECG Database has 14552 samples. Both are digitized at 125Hz.

Natural Language Processing

Sentiment140 is for sentiment analysis dataset. Great for NLP project and emotion analysis project. The dataset has the following 6 features. Emotions the polarity of the tweet (negative, neutral, positive), Id of the tweet, date, query, user and text of the tweet.

Tagged and Cleaned Wikipedia Corpus this can be a big data for NLP researchers to work with. In total it contains about 1.9 billion words with more than 4 million articles.

Machine Translation this dataset is use for the development of translation machine.
Below are the language pairs available for translation:
English-Chinese
English-Czech
English-Estonian
English-Finnish
English-German
English-Kazakh
English-Russian
English-Turkish

If you have any other open datasets to recommend for AI purpose, please feel free to suggest them in the comments section and let us know what are the features and annotations that is included within.

Comments

John HardyOctober 13, 2022 at 4:49 PM
Such a great well written article. I really got a lot of information in this article. Thanks for sharing this article and great information. Now it's time to avail reefer dispatch services for more information.
ReplyDelete
Replies
PG SLOTJuly 13, 2023 at 9:18 PM
PG SLOT AUTO ทางเข้า หรือ pg-slot.game เกมพีจีสล็อต pg สล็อตออนไลน์ 3 มิติ (pg slot auto) จากค่าย pg soft และ บาคาร่าออนไลน์ ที่มาในรูปแบบใหม่เป็นเกมส์ยอดนิยม
ReplyDelete
Replies

Add comment

Soundatventure

Search This Blog