Skip to main content
Data cleaning practice
Data cleaning is a hugely important part of data science, but it can be hard to find "good" messy datasets to practice your cleaning skills. This site includes datasets that need clearning/organizing/reformatting to be most useful, along with a brief overview of what needs to be fixed in each dataset.
UCI Machine Learning Repository
A collection of datasets and data generators used by the machine learning community. Currently has 452 datasets, searchable by data type, task of interest, domain area, and other attributes.
Search by size (GBs), file type, license type, and topic/domain tags.
Database of handwritten digits, with a training set of 60,000 and test set of 10,000 examples. Good dataset for pattern recognition.
Contains nearly 6 million Yelp reviews, plus photos and business attributes. Check out the suggested challenges
for photo classification and natural language processing and sentiment analysis.
AWS public datasets
Includes a large amount of satellite imagery (LANDSAT, Sentinel-2).
Microsoft Azure public datasets
Choose from US government data, statistical/scientific data, and online service data.
Recommender systems datasets
This compilation of datasets from UCSD's Julian McAuley includes datasets from Amazon, Goodreads, RentTheRunway, Facebook, Twitter, Reddit, and more.
YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. The YouTube-8M Segments dataset is an extension of the YouTube-8M dataset with human-verified segment annotations.