Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Finding Data & Statistics: Text Corpora

UCSD-licensed databases

Most of our licensed databases usually

  • limit the number of citations or articles that can be downloaded at once
  • prohibit systematic downloading (downloading of substantial collections)
  • prohibit automated downloading (using of scripts)
  • prohibit datamining directly on the vendor's servers
  • prohibit the redistribution of content (including cleaned data)

We can provide bulk downloads of the following ProQuest databases:

  • Congressional Record (part A)
  • History Vault: Vietnam War collection
  • History Vault: Immigration collection (part 1)
  • Chicago Tribune, 1849-1933
  • Los Angeles Times, 1881-1933
  • New York Times, 1851-1937
  • Wall Street Journal, 1889-1935
  • Washington Post, 1877-1935
  • San Francisco Chronicle, 1865-1922
  • American Periodicals Series
  • Periodicals Archive Online (series 1-5) 

Most of our other licensed ProQuest content, including news databases, is available for analysis with R or Python through the ProQuest TDM Studio. Contact Data Science Librarian Stephanie Labou with any questions.

Other databases that support text analysis in some way:

Please email The Library with questions about any specific resource or database.

Freely available corpora & bulk data