Skip to main content

8 popular free public data set for machine learning and deep learning

The key to master machine learning is through practice. It is only through experience that you will come to understand the true capability of machine learning and its limitations. I have curated this list of publicly downloadable high quality data sets so that you can learn and hone your skills.

The 8 data sets below consist of 4 main category, image, audio,signal and natural language processing. Each of these real-world data has its own unique nuance and approach for artificial intelligence and machine learning experimentation.


MNIST is a popular handwritten digits data set that contains a training set of 60,000 handwritten numbers and a test set of 10,000  handwritten numbers. It’s a great database to test your machine learning and pattern recognition algorithms.

Fashion MNIST contains a training set of 60,000 product images and testing set of 10,000 product images. It is an alternative to MNIST database as some believed  MNIST the handwritten digits data has been overused. The images are in grey scale with 10 different product category. eg. Shirt, Dress, Sneaker, Bag, Boots..

SVHN (Street View House Numbers) is a real-world image data set for developing digits recognition algorithms. The data was obtained from house numbers viewed in Google Street View. It is significantly harder, real world problem compared to the binary number images in the MNIST data.

Audio and Signal

FSDD (Free Spoken Digit Dataset) is another MNIST inspired data set. This is an English audio/speech data set that consists of spoken digits recordings in wav files of 8kHz.  It was created for the task for identify spoken digits in audio samples. Currently it has 2,000 recordings from 4 speakers.

ECG Heartbeat Categorization Dataset a large library of prepossessed clinical records derived two famous datasets in heartbeat classification, the MIT-BIH Arrhythmia Dataset and The PTB Diagnostic ECG Database. Arrhythmia Dataset has 109446 samples while the PTB Diagnostic ECG Database has 14552 samples. Both are digitized at 125Hz.

Natural Language Processing

Sentiment140 is for sentiment analysis dataset. Great for NLP project and emotion analysis project. The dataset has the following 6 features. Emotions the polarity of the tweet  (negative, neutral, positive),  Id of the tweet, date, query, user and text of the tweet.

Tagged and Cleaned Wikipedia Corpus this can be a big data for NLP researchers to work with. In total it contains about 1.9 billion words with more than 4 million articles.

Machine Translation this dataset is use for the development of translation machine.
Below are the language pairs available for translation:

If you have any other open datasets to recommend for AI purpose, please feel free to suggest them in the comments section and let us know what are the features and annotations that is included within.


Post a Comment

Thanks for reading my post =)
Leave a comment or like our facebook!
Do come back again as I will make every effort in replying your message as soon as possible!

Popular posts from this blog

Six Free Best Word Cloud maker online

I was reading about info-graphics on the web the other day and manage to stumble upon this. Have always loved word art: calligraphy, ambigram.. etc
If you like word art and collage, you would probably like this.

Being able to form word clouds from a paragraph of text can be very fun and useful at times, especially when you want to have a quick understand about the words density, the frequency of each word, by the means of colours, shapes, size, font type etc.

Here I have a six online free word cloud generators that allow you to create your own word collage.

Ever wonder how can you make this? Fear not! There are free online software that allows you to create your own word cloud.

The Complete Guide: Make your PVC Quena Flute

Why I started making flutes and a little history about quena flute...

Let's make Quena! This will be a long post.. step by step  .. on making a quena flute.
I will provide you with my personal secret handy-dandy tips n tricks, hopefully they will be useful for your flute making. And also I've included the making plans and dimensions for my flute.

Don't worry if you do not own any heavy machines or milling machines.. I just find whatever tools I can get around me to make my PVC flute. In another words all the tools that I will be using should be easily obtainable. Yeah!

Soup Restaurant (三盅两 件) at IMM

Heh.. I should have done this review earlier..

If you are a herbal soup enthusiast. If you are looking for a dining place where your old folks can enjoy. then this might be it~
Soup Restaurant is a Cantonese themed eatery that offers Heritage Cuisine that were served in the Chinatown night bazaar in the 1960s~

Soup Restaurant's cantonese name sam zhong leung khin (三盅 件 - three bowls two dishes), is a derivation of a popular Cantonese expression yat zhong leung khin (一盅 件 - one bowl two dishes) which signifies a relaxed lifestyle of consuming teas and dim sum dishes at tea houses in the early morning.