Finding Data & Statistics: Data Science

General data science

UCI Machine Learning Repository
A collection of datasets and data generators used by the machine learning community. Currently has >600 datasets, searchable by data type, task of interest, domain area, and other attributes.
CORGIS: The Collection of Really Great, Interesting, Situated Datasets
Includes a wide variety of data formatted and ready to use for various data science methods. Data is also available in JSON format and as importable modules in Python. Read more about CORGIS.
DASL – The Data And Story Library
Archive of hundreds of datafiles for use by students and teachers of statistics and data science. DASL provides background information about the data and a source reference whenever that information is available.
HDSI DataPlanet
The HDSI Data Planet Initiative creates a collective resource on campus to enable and spur easily reproducible, shareable, and searchable data-intensive research at UC San Diego. Contains data derived from, or otherwise related to, data science focused research on campus.
Datasets from Papers with Code
As of 2022, contains nearly 7000 machine learning datasets. Filter by data type (image, text, video, audio, etc.)
Recommender systems datasets
This compilation of datasets from UCSD's Julian McAuley includes datasets from Amazon, Goodreads, RentTheRunway, Facebook, Twitter, Reddit, and more.
Hugging Face datasets
Filter by modality (audio, text, tabular, etc.), language, task, and more.
Face image databases
This guide from Princeton University Library has links to a variety of face image databases. Note: "Please read the rights, permissions, licensing information on the database's webpage before proceeding with use. Make sure to obtain the permissions required and credit/cite as requested by the creators."
COVID-19 social media datasets
Within this guide, under Find Data by Topic/COVID-19. Scroll down to "Social media data"
Kaggle datasets
Search by size (GBs), file type, license type, and topic/domain tags.
MNIST
Database of handwritten digits, with a training set of 60,000 and test set of 10,000 examples. Good dataset for pattern recognition.
Yelp dataset
Contains nearly 7 million Yelp reviews, plus photos and business attributes.
YouTube-8M
YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. The YouTube-8M Segments dataset is an extension of the YouTube-8M dataset with human-verified segment annotations.
Kickstarter Data, Global, 2009-2020
The study includes two datasets: 1) Kickstarter Project dataset and 2) Backer Location dataset. The Kickstarter Project dataset contains detailed information about all Kickstarter projects starting in 2009 through 2020, including the project title, project category and subcategory, project location (city, state (for U.S. based projects), and country), funding goal in original and U.S. currencies, pledged amount (in dollars), number of backers for each project, etc. The Kickstarter Project dataset includes data on 506,199 successful and unsuccessful funded projects. The Backer Location dataset includes information about backers' country and state, and the total amount pledged for each geographic location.
Common Crawl
The Common Crawl corpus contains petabytes of data collected over 12 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets.
AWS public datasets
Includes a large amount of satellite imagery (LANDSAT, Sentinel-2).
Microsoft Azure public datasets
Choose from US government data, statistical/scientific data, and online service data.

ProQuest TDM Studio
TDM Studio has a Jupyter Notebook interface from which users can analyze decades of newspaper data from major newspapers like the New York Times, Washington Post, Los Angeles Times, and others.

To create an account:
1. Go to https://tdmstudio.proquest.com
2. Click “Create an account” button
3. Use your UCSD email address to create your account.

Data Is Plural archive
Data Is Plural is a weekly newsletter of useful/curious datasets, published by Jeremy Singer-Vine. This is a great resource for finding data sets that aren't necessarily data science focused per se, but could have a variety of data science applications (recommendation algorithms, predictive analytics, etc.)
Data cleaning practice
Data cleaning is a hugely important part of data science, but it can be hard to find "good" messy datasets to practice your cleaning skills. This site includes datasets that need cleaning/organizing/reformatting to be most useful, along with a brief overview of what needs to be fixed in each dataset.

Large language models

LLMDataHub
Collection of dataset for use in large language models (LLMs).
Common Corpus (PleIAs / Hugging Face)
Common Corpus is the largest public domain dataset for training large language models (LLMs). The dataset is multilingual and includes 500 billion words. Read more about Common Corpus on Hugging Face.
Awesome LLM Datasets
Summarize existing representative LLMs text datasets across five dimensions: Pre-training Corpora, Fine-tuning Instruction Datasets, Preference Datasets, Evaluation Datasets, and Traditional NLP Datasets. New dataset sections have been added: Multi-modal Large Language Models (MLLMs) Datasets, Retrieval Augmented Generation (RAG) Datasets.
WildChat
The WildChat Dataset from Allen Institute for AI is a corpus of 1 million real-world user-ChatGPT interactions, characterized by a wide range of languages and a diversity of user prompts. It was constructed by offering free access to ChatGPT and GPT-4 in exchange for consensual chat history collection.

Network data

Stanford Large Network Dataset Collection
Includes a variety of network data types, including social networks, networks with ground-truth communities, temporal networks, web graphs, collaboration networks, and more.
"Awesome Network Analysis" datasets
An awesome list of resources to construct, analyze and visualize network data. Inspired by Awesome Deep Learning, Awesome Math and others.
Network Data Repository
This large comprehensive collection of network graph data includes benchmark network data sets for a wide variety of applications and domains (e.g., network science, bioinformatics, machine learning, data mining, physics, and social science) and includes relational, attributed, heterogeneous, streaming, spatial, and time series network data as well as non-relational machine learning data. All graph data sets are downloaded into a standard consistent format.
UCI Network Data Repository
This project is an effort on the part of UCI Datalab to promote the study of networks, whether it be for the study of social networks, web science or systems biology. The site also has a list of additional public network datasets.
Sample Social Network Datasets For Gephi
This repository contains sample social network datasets specifically collected and formatted for teaching with Gephi. Each folder contains a nodes csv, an edges csv, and a GraphML file that can be imported into Gephi, as well as background information about the original source of the data, the methodologies used to compile it, and the context/significance of the social network.