Finding Data & Statistics: Ngrams

Sources with Ngrams

Bookworm
A powerful way to visualize trends in repositories of digitized texts maintained by Ben Schmidt and Erez Aiden. Interactive examples:
- OpenLibrary open domain books
- State of the Union addresses
- State of the Union in context
- RateMyProfessor teacher reviews by gender
- Baby Names US birth records
- The Simpsons
- Usenet
Google Books ngram viewer
HathiTrust Bookworm
Search for trends in the public domain texts from HathiTrust Digital Library
JSTOR Text analysis support
Request access to the metadata and full-text of available JSTOR journals, books, research reports, and pamphlets for text analysis and digital humanities research.
Robots Reading Vogue
Project from Yale based on Vogue Archive (ProQuest)
Corpus of Contemporary American English
the largest freely-available corpus of English, and the only large and balanced corpus of American English.
GloWbE: Corpus of Global Web-Based English
Download 440 million words of full-text data for COCA, or 1.8 billion words for GloWbE. With this data, you will have the corpora on your computer, rather than having to use the web interface. The data comes in three formats: tables for relational databases, word/lemma/PoS (vertical format), or text (linear format).
ICWSM Datasets
- ICWSM 2011 Spinn3r Dataset
That dataset, provided by Spinn3r.com, is a continuation of the 2009 Spinn3r Dataset. The dataset consists of over 386 million blog posts, news articles, classifieds, forum posts and social media content between January 13th and February 14th.
- ICWSM 2009 Spinn3r Blog Dataset
The dataset, provided by Spinn3r.com, is a set of 44 million blog posts made between August 1st and October 1st, 2008.
- JDPA Sentiment Corpus
The JDPA Corpus consists of user-generated content (blog posts) containing opinions about automobiles and digital cameras. They have been manually annotated for named, nominal, and pronominal mentions of entities. Entities are marked with the aggregate sentiment expressed toward them in the document.
Note that these datasets are free but researchers will need to contact the ICWSM and sign a usage agreement to be granted access.