Skip to Main Content

Finding Data & Statistics: Text Corpora

UCSD-licensed databases

Most of our licensed databases usually

  • limit the number of citations or articles that can be downloaded at once
  • prohibit systematic downloading (downloading of substantial collections)
  • prohibit automated downloading (using of scripts)
  • prohibit datamining directly on the vendor's servers
  • prohibit the redistribution of content (including cleaned data)

We can provide bulk downloads of the following ProQuest databases:

  • Congressional Record (part A)
  • History Vault: Vietnam War collection
  • History Vault: Immigration collection (part 1)
  • Chicago Tribune, 1849-1933
  • Los Angeles Times, 1881-1933
  • New York Times, 1851-1937
  • Wall Street Journal, 1889-1935
  • Washington Post, 1877-1935
  • San Francisco Chronicle, 1865-1922
  • American Periodicals Series
  • Periodicals Archive Online (series 1-5) 

Most of our other licensed ProQuest content, including news databases, is available for analysis with R or Python through the ProQuest TDM Studio. Contact Data Science Librarian Stephanie Labou with any questions.

Other databases that support text analysis in some way:

Please email The Library with questions about any specific resource or database.

Freely available corpora & bulk data