Skip to main content

8 popular free public data set for machine learning and deep learning

The key to master machine learning is through practice. It is only through experience that you will come to understand the true capability of machine learning and its limitations. I have curated this list of publicly downloadable high quality data sets so that you can learn and hone your skills.

The 8 data sets below consist of 4 main category, image, audio,signal and natural language processing. Each of these real-world data has its own unique nuance and approach for artificial intelligence and machine learning experimentation.


Image


MNIST is a popular handwritten digits data set that contains a training set of 60,000 handwritten numbers and a test set of 10,000  handwritten numbers. It’s a great database to test your machine learning and pattern recognition algorithms.

Fashion MNIST contains a training set of 60,000 product images and testing set of 10,000 product images. It is an alternative to MNIST database as some believed  MNIST the handwritten digits data has been overused. The images are in grey scale with 10 different product category. eg. Shirt, Dress, Sneaker, Bag, Boots..

SVHN (Street View House Numbers) is a real-world image data set for developing digits recognition algorithms. The data was obtained from house numbers viewed in Google Street View. It is significantly harder, real world problem compared to the binary number images in the MNIST data.

Audio and Signal


FSDD (Free Spoken Digit Dataset) is another MNIST inspired data set. This is an English audio/speech data set that consists of spoken digits recordings in wav files of 8kHz.  It was created for the task for identify spoken digits in audio samples. Currently it has 2,000 recordings from 4 speakers.

ECG Heartbeat Categorization Dataset a large library of prepossessed clinical records derived two famous datasets in heartbeat classification, the MIT-BIH Arrhythmia Dataset and The PTB Diagnostic ECG Database. Arrhythmia Dataset has 109446 samples while the PTB Diagnostic ECG Database has 14552 samples. Both are digitized at 125Hz.

Natural Language Processing


Sentiment140 is for sentiment analysis dataset. Great for NLP project and emotion analysis project. The dataset has the following 6 features. Emotions the polarity of the tweet  (negative, neutral, positive),  Id of the tweet, date, query, user and text of the tweet.

Tagged and Cleaned Wikipedia Corpus this can be a big data for NLP researchers to work with. In total it contains about 1.9 billion words with more than 4 million articles.

Machine Translation this dataset is use for the development of translation machine.
Below are the language pairs available for translation:
English-Chinese
English-Czech
English-Estonian
English-Finnish
English-German
English-Kazakh
English-Russian
English-Turkish

If you have any other open datasets to recommend for AI purpose, please feel free to suggest them in the comments section and let us know what are the features and annotations that is included within.

Comments

  1. Such a great well written article. I really got a lot of information in this article. Thanks for sharing this article and great information. Now it's time to avail reefer dispatch services for more information.

    ReplyDelete
  2. PG SLOT AUTO ทางเข้า หรือ pg-slot.game เกมพีจีสล็อต pg สล็อตออนไลน์ 3 มิติ (pg slot auto) จากค่าย pg soft และ บาคาร่าออนไลน์ ที่มาในรูปแบบใหม่เป็นเกมส์ยอดนิยม

    ReplyDelete

Post a Comment

Thanks for reading my post =)
Leave a comment or like our facebook!
Do come back again as I will make every effort in replying your message as soon as possible!

Popular posts from this blog

What is SEO? The really basic definition

I've been talking about SEO on my blog for quite a long time but some are still unsure of what SEO exactly is about.. So for the benefits of those beginners who wants to know more about web page ranking or want to gain more traffics to your blog/site. This is something that you will want to know. The word "SEO" is an acronym that stands for Search Engine Optimization .   First let's breakup the words into two segments and understand what each of them means.

Soup Restaurant (三盅两 件) at IMM

Heh.. I should have done this review earlier.. anyway... If you are a herbal soup enthusiast. If you are looking for a dining place where your old folks can enjoy. then this might be it~ Soup Restaurant is a Cantonese themed eatery that offers Heritage Cuisine that were served in the Chinatown night bazaar in the 1960s~ Soup Restaurant's cantonese name sam zhong leung khin (三盅 两  件 - three bowls two dishes), is a derivation of a popular Cantonese expression  yat zhong leung khin  (一盅 两  件 - one bowl two dishes) which signifies a relaxed lifestyle of consuming teas and dim sum dishes at tea houses in the early morning. 

What is This Green green Profile Picture About

Some of you may be wondering why your friends' profile pictures are becoming green on social network platform ..facebook, twitter or may be some others.. This is not some technical glitch.. This is an ongoing protest to show support for the special effect VFX industry.