Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Data cleaning practice
Data cleaning is a hugely important part of data science, but it can be hard to find "good" messy datasets to practice your cleaning skills. This site includes datasets that need clearning/organizing/reformatting to be most useful, along with a brief overview of what needs to be fixed in each dataset.
UCI Machine Learning Repository
A collection of datasets and data generators used by the machine learning community. Currently has 452 datasets, searchable by data type, task of interest, domain area, and other attributes.
Search by size (GBs), file type, license type, and topic/domain tags.
Database of handwritten digits, with a training set of 60,000 and test set of 10,000 examples. Good dataset for pattern recognition.
Contains nearly 6 million Yelp reviews, plus photos and business attributes. Check out the suggested challenges
for photo classification and natural language processing and sentiment analysis.
AWS public datasets
Includes a large amount of satellite imagery (LANDSAT, Sentinel-2).
Microsoft Azure public datasets
Choose from US government data, statistical/scientific data, and online service data.
Recommender systems datasets
This compilation of datasets from UCSD's Julian McAuley includes datasets from Amazon, Goodreads, RentTheRunway, Facebook, Twitter, Reddit, and more.
YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. The YouTube-8M Segments dataset is an extension of the YouTube-8M dataset with human-verified segment annotations.
Data Is Plural archive
Data Is Plural is a weekly newsletter of useful/curious datasets, published by Jeremy Singer-Vine. This is a great resource for finding data sets that aren't necessarily data science focused per se, but could have a variety of data science applications (recommendation algorithms, predictive analytics, etc.)