Finding Data & Statistics: Text Corpora

UCSD-licensed databases

Most of our licensed databases usually

limit the number of citations or articles that can be downloaded at once
prohibit systematic downloading (downloading of substantial collections)
prohibit automated downloading (using of scripts)
prohibit datamining directly on the vendor's servers
prohibit the redistribution of content (including cleaned data)

We can provide bulk downloads of the following ProQuest databases:

Congressional Record (part A)
History Vault: Vietnam War collection
History Vault: Immigration collection (part 1)
Chicago Tribune, 1849-1933
Los Angeles Times, 1881-1933
New York Times, 1851-1937
Wall Street Journal, 1889-1935
Washington Post, 1877-1935
San Francisco Chronicle, 1865-1922
American Periodicals Series
Periodicals Archive Online (series 1-5)

Most of our other licensed ProQuest content, including news databases, is available for analysis with R or Python through the ProQuest TDM Studio. Contact Data Science Librarian Stephanie Labou with any questions.

Other databases that support text analysis in some way:

ProQuest TDM Studio
TDM Studio is the text and data mining interface for more than 200 licensed ProQuest content products, including government, archival, dissertation, and news databases. Content is available for analysis with R or Python in the workbench dashboard, or use the visualization dashboard to interact with and visualize content without any coding needed. Contact Stephanie Labou (slabou@ucsd.edu) for additional assistance.

To create an account:
1. Go to https://tdmstudio.proquest.com
2. Click “Create an account” button
3. Use your UCSD email address to create your account.
Gale Digital Scholar Lab
Explore UCSD holdings from Gale Primary Sources using digital humanities text and data mining tools. No coding required! Rediscover and interpret the past through analysis and visualization of historical texts, including newspapers, books, archival collections, and more. (Create your personal DSL account online to begin selecting and analyzing materials. Be sure to be on VPN if off campus for account to work properly.) Learn more via tutorials and recorded webinars
Linguistic Data Consortium (LDC)
The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories that hosts a repository devoted to acquiring, archiving, preserving and distributing linguistic corpora. These corpora are searchable via the LDC catalog. UC San Diego's membership allows UC San Diego students, faculty, and employees to register for a login that, once approved, provides free access to the datasets included with our membership years and a 50% discount on other datasets. Note that these datasets can only be used for educational, non-commercial text and data mining projects.
VoxGov
Includes a broad range of official and ephemeral information resources issued by federal agencies, individual officials and candidates, and other organizations from all branches of the U.S. Federal Government, and links that content to publicly accessible government documentation. Includes social media, official media releases, legislation, regulations, and a variety of government documents from Congress and the Executive branches. Textual data can be visualized in word clouds, tree maps, bubble graphs, and terms view graphs. Users who sign up for an account and agree to additional terms of service can download a small number of full documents; researchers and students with non-commercial, academic projects can apply with VoxGov for additional bulk data download credentials.
HathiTrust Research Center Analytics
HathiTrust Research Center (HTRC) enables computational analysis (text and data mining) of works in the HathiTrust Digital Library (HTDL) to facilitate non-profit research and educational uses of the collection. HTRC creates and maintains a suite of tools and services for text-based, data-driven research, such as HTRC Algorithms and Data Capsule, and engages in cutting-edge research on large-scale data analysis. HTRC operates under a non-consumptive research paradigm: HTRC makes available the collection for computational analysis, while remaining clearly within the bounds of the fair use rights courts have recognized as applying to text analysis. The Center is committed to breaking new ground in the areas of non-consumptive text mining, allowing scholars to fully utilize content of the HathiTrust Digital Library.
Note: HathiTrust will cease funding the HathiTrust Research Center (HTRC) at the end of 2026.
Adam Matthew Explorer
Adam Matthew publishes unique primary source collections from archives around the world. The categorized collections cover many contemporary subjects such as culture, war, lifestyles, business technology and sociopolitical development in the Americas, Asia and Europe. We also have text and data mining access to our licensed Adam Matthew databases.
Search across all of UC San Diego's Adam Matthew archival collections. We also have text and data mining access to our licensed Adam Matthew databases.

Collections include:

African American Communities
Age of Exploration
America in World War Two:Oral Histories and Personal Accounts
American History, 1493-1945
American Indian Histories and Cultures
American Indian Newspapers
American West
Apartheid South Africa,1948-1980
China, America and the Pacific
China: Culture and Society
China: Trade, Politics and Culture, 1793-1980
Church Missionary Society Periodicals
Colonial America
Colonial Caribbean
Confidential Print: Africa, 1834-1966
Confidential Print: Latin America, 1833-1969
Confidential Print: Middle East
Confidential Print: North America
Defining Gender
East India Company
Eighteenth Century Drama
Eighteenth Century Journals
Empire Online
Ethnomusicology: Global Field Recordings
Everyday Life and Women in America
First World War Portal
Food and Drink in History
Foreign Office Files China 1919-1980
Foreign Office Files India, Pakistan and Afghanistan,1947-1980
Foreign Office Files Japan, 1919-1952
Foreign Office Files Middle East, 1971-1981
Foreign Office Files South East Asia, 1963-1980
Frontier Life
Gender: Identity and Social Change
Global Commodities
India, Raj and Empire
Interwar Culture
J. Walter Thompson:Advertising America
Jewish Life in America
Leisure Travel and Mass Culture
Life at Sea: Seafaring in the Anglo-American Maritime World, 1600-1900
Literary Manuscripts Berg
Literary Manuscripts Leeds
Literary Print Culture
London Low Life
Macmillan Cabinet Papers, 1957-1963
Market Research and American Business, 1935-1965
Mass Observation Online
Medical Services and Warfare
Medieval Family Life
Medieval Travel Writing
Meiji Japan
Migration to New Worlds
Perdita Manuscripts, 1500-1700
Popular Culture in Britain and America, 1950-1975
Popular Medicine in America, 1800-1900
Race Relations in America
Romanticism: Life, Literature and Landscape
Service Newspapers of World War Two
Sex and Sexuality
Shakespeare in Performance
Shakespeare's Globe Archive
Slavery, Abolition and Social Justice
Socialism on Film
The Grand Tour
The Nixon Years, 1969-1974
Trade Catalogues and the American Home
Travel Writing, Spectacle and World History
Victorian Popular Culture
Victorians on Film. Entertainment, Innovation & Everyday Life
Virginia Company Archives
Women in the National Archives (UK)
World's Fairs

JSTOR for Research
Data for Research is a free service for researchers wishing to analyze content on JSTOR through a variety of lenses and perspectives. If you require more than 1,000 documents or a type of data not available through the interactive portion of the site, please contact us at: support@ithaka.org

Constellate
JSTOR and Portico are building a text and data mining (TDM) platform aimed at teaching and enabling a generation of researchers to text mine. The platform includes a user interface to allow researchers, students, and instructors to curate, visualize, and save custom datasets. Researchers may download the extracted features of their curated datasets. Extracted features are a non-consumptive “bag-of-words” where each article or book chapter in the custom dataset is represented with bibliographic metadata, the unique set of words on each page, and the number of times the word occurs on the page. The dataset includes journals, books, and newspapers from JSTOR , Portico, and Chronicling America
Note: ITHAKA has decided to sunset Constellate on July 1, 2025.

Please email The Library with questions about any specific resource or database.

Freely available corpora & bulk data

the @unitedstates project
@unitedstates is a shared commons of data and tools for the United States. Made by the public, used by the public. Featuring work from people with the Sunlight Foundation, GovTrack.us, the New York Times, the Electronic Frontier Foundation, and the Internet
American Presidency Project
One of the most comprehensive collection of web resources on the American presidency, including documents, public papers, executive orders, addresses, press conferences, debates, election data, approval ratings, much more. Data topics include Relations with Congress / Popularity / Public Appearances / Growth of the Executive Branch / Presidential Selection / State of the Union and Inaugural Address Charts / Presidential Disability
arXiv Bulk Data Access
Awesome Public Datasets
An awesome list of high-quality open datasets (HQOD) in public domains (on-going).
Caselaw Access Project
All published U.S. court decisions freely available to the public online, digitized from the collection of the Harvard Law School Library. Includes bulk download, API access, Historical Trends visualizer, and a variety of apps created using the data..
Chronicling America
OCR bulk downloads from the Library of Congress of America's historic newspaper pages from 1836-1922. See also Alex Leslie's workshop and code for text analysis of Chronicling America newspapers in R.
CMU Movie Summary Corpus
This page provides links to a dataset of movie plot summaries and associated metadata. This data was collected by David Bamman, Brendan O'Connor, and Noah Smith at the Language Technologies Institute and Machine Learning Department at Carnegie Mellon University.
Common Crawl
Open repository of web crawl data that can be accessed and analyzed by anyone.
Congressional and Federal Government Web Harvests
Since 2006, the National Archives and Records Administration (NARA) has harvested Congressional web sites at the end of each Congress. They also did a wider harvest of federal websites for the 2004 Presidential transition.
Consumer Complaint Database
From the Consumer Financial Protection Bureau
Corpus of Contemporary American English
the largest freely-available corpus of English, and the only large and balanced corpus of American English.
Court Listener - Bulk Data
Includes bulk download of federal and state court opinions, oral arguments, dockets, and judges; a citation database; and the judicial database of biographical information about judges. Site also includes online network visualization tool.
David D. Lewis - Test Collections
Reuters-21578 (and Reuters-22173): The most widely used text categorization test collection.

RCV1 (Reuters Corpus Volume 1): A large, high quality, recently released collection of news stories. Likely to become the new standard benchmark in text categorization research.

TREC-AP : A text categorization task based on the Associated Press articles used in the NIST TREC evaluations.
DocNow Catalog (Documenting the Now)
The DocNow Catalog is a collectively curated listing of Twitter datasets. Public datasets are shared as Tweet IDs, which can be hydrated back into full datasets using our Hydrator desktop application.
English-Corpora.org
Includes: News on the Web (NOW); iWeb: The Intelligent Web-based Corpus; Global Web-Based English (GloWbE); Wikipedia Corpus; Coronavirus Corpus; Corpus of Contemporary American English (COCA); Corpus of Historical American English (COHA); The TV Corpus; The Movie Corpus; Corpus of American Soap Operas;
Hansard Corpus; Early English Books Online; Corpus of US Supreme Court Opinions; TIME Magazine Corpus; British National Corpus (BNC) * ; Strathy Corpus (Canada); CORE Corpus; From Google Books n-grams (compare): American English; British English
GovInfo: Bulk Data
Congressional Bills; Bill Status; Bill Summaries; Commerce Business Daily; Code of Federal Regulations (Annual Edition); Electronic Code of Federal Regulations; Federal Register; United States Government Manual; House Rules and Manual; Privacy Act Issuances; Public Papers of the Presidents of the United States; Supreme Court Decisions 1937-1975 (FLITE)
The GDELT Project
Monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.

Data visualizations for events from 2017-present and up to 250 articles at a time can be accessed by changing the structure of the API url (manually or through web forms) to reflect the search query.

Raw data can be analyzed using SQL queries on Google BigQuery. Researchers with the capacity to work with 2.5TB for each year of data can download the raw data directly from GDELT.

Most documentation is shared on various dates throughout the GDELT blog.

A Python client to fetch data from the GDELT 2.0 Doc API.

more... less...

The GDELT 1.0 Event Database contains over a quarter-billion records organized into a set of tab-delimited files by date. Through March 31, 2013 records are stored in monthly and yearly files by the date the event took place. Beginning with April 1, 2013, files are created daily and records are stored by the date the event was found in the world's news media rather than the date it occurred (97%+ of events are reported within 24 hours of happening, but a small number of events each day are past events being mentioned for the first time - if an event has been seen before it will not be included again). Files are ZIP compressed in tab delimited format, but named with a ".CSV" extension.

Ken Blake's "A short intro to GDELT" tutorial

GDELT Summary + GDELT APIs
In addition to its datasets, GDELT also offers a number of live realtime JSON APIs offering fulltext search and other capabilities, including DOC, GEO and TV. Explore them using the GDELT Summary web form that offers a non-technical, human-friendly website wrapper around the APIs, showcasing their capabilities.

UK Foreign & Commonwealth Office, Open Source Unit has also created a GDELT query interface for working with the API that ofters searching by theme as well by keyword.

Themes taxonomy

GDELT datasets and specialized datasets (some for download, some only available on Google BigQuery)

Google BigQuery documentation
GloWbE: Corpus of Global Web-Based English
Download 440 million words of full-text data for COCA, or 1.8 billion words for GloWbE. With this data, you will have the corpora on your computer, rather than having to use the web interface. The data comes in three formats: tables for relational databases, word/lemma/PoS (vertical format), or text (linear format).
HathiTrust
Non-Google digitized collection: Approximately 550,000 public domain volumes as of March 2015, primarily, though not exclusively, English language materials published prior to 1923. Google-digitized volumes (requires institutional signature): Approximately 4.8 million public domain volumes as of March 2015, representing a wide variety of languages, subjects, and dates. See the visualizations of HathiTrust public domain volumes.
Internet Archive
How to download in bulk using wget
JSTOR for Research
Data for Research is a free service for researchers wishing to analyze content on JSTOR through a variety of lenses and perspectives. If you require more than 1,000 documents or a type of data not available through the interactive portion of the site, please contact us at: support@ithaka.org
Million Song Dataset
A freely-available collection of audio features and metadata for a million contemporary popular music tracks. Sub-datasets include:
- SecondHandSongs dataset -> cover songs
- musiXmatch dataset -> lyrics
- Last.fm dataset -> song-level tags and similarity
- Taste Profile subset -> user data
- thisismyjam-to-MSD mapping -> more user data
- tagtraum genre annotations -> genre labels
- Top MAGD dataset -> more genre labels
Open American National Corpus
OpenLibrary
Open Library is an open, editable library catalog, building towards a web page for every book ever published.
OpenSubtitles.org
Movie subtitles
US Patents & Trademarks
Bulk downloads of patent and trademark data from US Patent and Trademark Office
Industry Documents Library
Includes documents received via FOIA and other means; collections related to tobacco, drugs, chemicals, food, fossil fuels. Documents available for download in bulk, by source.
Project Gutenberg: Information About Robot Access to our Pages
Project Gutenberg offers over 50,000 free ebooks. Note from Terms of Use: This website is intended for human users only. Any perceived use of automated tools to access this website will result in a temporary or permanent block of your IP address with a few exceptions.
Public.Resource.Org
Bulk downloadable content harvested from government web sites and other sources
PubMed Central Open Access subset
Qualitative Data Repository
The Qualitative Data Repository (QDR) is a dedicated archive for storing and sharing digital data (and accompanying documentation) generated or collected through qualitative and multi-method research in the social sciences and related disciplines.
Robots Reading Vogue
Project from Yale based on Vogue Archive (ProQuest)
SNAP: Stanford Network Analysis Project
Stanford Network Analysis Platform (SNAP) is a general purpose network analysis and graph mining library.
Text Creation Partnership
The Text Creation Partnership creates standardized, accurate XML/SGML encoded electronic text editions of early print books. We transcribe and mark up the text from the millions of page images in ProQuest's Early English Books Online, Gale Cengage's Eighteenth Century Collections Online, and Readex's Evans Early American Imprints.
Observatory on Social Media (OSoMe):
OSoMe (awe•some) is an API and set of tools created as a broad research project aimed to study information diffusion in social media.
United Nations Parallel Corpus
The United Nations Parallel Corpus is composed of official records and other parliamentary documents of the United Nations that are in the public domain. These documents are mostly available in the six official languages of the United Nations. The corpus was created as part of the United Nations commitment to multilingualism and as a reaction to the growing importance of statistical machine translation (SMT) within the Department for General Assembly and Conference Management (DGACM) translation services and the United Nations SMT system, Tapta4UN. The purpose of the corpus is to allow access to multilingual language resources and facilitate research and progress in various natural language processing tasks, including machine translation.
University of Oxford Text Archive
The University of Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. The OTA also gives advice on the creation and use of these resources, and is involved in the development of standards and infrastructure for electronic language resources.
Wikipedia - Wikimedia Downloads
WordHoard
tagged literary texts
Yahoo Webscope Datasets
A reference library of interesting and scientifically useful datasets for non-commercial use by academics and other scientists.