Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Sources with Ngrams
Search for trends in the public domain texts from HathiTrust Digital Library
JSTOR for Research
Data for Research is a free service for researchers wishing to analyze content on JSTOR through a variety of lenses and perspectives. If you require more than 1,000 documents or a type of data not available through the interactive portion of the site, please contact us at: email@example.com
Robots Reading Vogue
Project from Yale based on Vogue Archive (ProQuest)
Corpus of Contemporary American English
the largest freely-available corpus of English, and the only large and balanced corpus of American English.
GloWbE: Corpus of Global Web-Based English
Download 440 million words of full-text data for COCA, or 1.8 billion words for GloWbE. With this data, you will have the corpora on your computer, rather than having to use the web interface. The data comes in three formats: tables for relational databases, word/lemma/PoS (vertical format), or text (linear format).
- ICWSM 2011 Spinn3r Dataset
That dataset, provided by Spinn3r.com, is a continuation of the 2009 Spinn3r Dataset. The dataset consists of over 386 million blog posts, news articles, classifieds, forum posts and social media content between January 13th and February 14th.
- ICWSM 2009 Spinn3r Blog Dataset
The dataset, provided by Spinn3r.com, is a set of 44 million blog posts made between August 1st and October 1st, 2008.
- JDPA Sentiment Corpus
The JDPA Corpus consists of user-generated content (blog posts) containing opinions about automobiles and digital cameras. They have been manually annotated for named, nominal, and pronominal mentions of entities. Entities are marked with the aggregate sentiment expressed toward them in the document.
Note that these datasets are free but researchers will need to contact the ICWSM and sign a usage agreement to be granted access.