It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
NOTE: We now have a membership to Linguistic Data Consortium (LDC). If you'd like a login, sign up for an account at the link below and you will be added as a user with access to our paid-for datasets. Contact Tamara Rhodes at email@example.com if you have any questions.
This preliminary one-year license only includes ONLINE USE of the corpora. If you’d like to download the corpora, you’ll need to purchase on your own. For each type or corpora, there is some free data available.
To access our license, use the VPN and create an account on English-corpora.org. Once you create an account, choose University of California, San Diego as your institution.
The website is a little circuitous, so if you have any questions or access issues, reach out.
The aim of the organization is to collect and distribute information on English language material available for computer processing and on linguistic research completed or in progress on the material, to compile an archive of English text corpora in machine-readable form, and to make material available to research institutions.
Here, you’ll find links to any- & everything to do with the use of language corpora. The links are categorised and annotated to facilitate browsing/searching. Just click on a category in the left frame to see a list of links in this main window.
This catalog, developed by the Open Language Archives Community (OLAC), provides access to a wealth of information about thousands of languages, including details of text collections, audio recordings, dictionaries, and software, sourced from dozens of digital and traditional archives.
The Survey of English Usage carries out research in English language Corpus Linguistics, and was the first centre in Europe to undertake this type of research. From its inception in 1959, the Survey collected samples of naturally-occurring language for the purposes of description and analysis.
The TIME corpus is based on 100 million words of text in about 275,000 articles from TIME magazine from 1923-2006, and it serves as a great resource to examine changes in American English during this time.
The Tromsø Repository of Language and Linguistics (TROLLing) is designed as an archive of linguistic data and statistical code. The archive is open access, which means that all information is available to to everyone.
Recordings of hundreds of languages from around the world, providing source materials for phonetic and phonological research, of value to scholars, speakers of the languages, and language learners alike.
NLTK is a platform for building Python programs to work with human language data. It provides interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum. NLTK is available for Windows, Mac OS X, and Linux, and is a free, open source, community-driven project.
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
A large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of more than 40 authors. Includes 141 maps with accompanying texts on diverse features (such as vowel inventory size, noun-genitive order, passive constructions, and "hand"/"arm" polysemy). Each map shows between 120 and 1370 languages, each language being represented by a symbol, and different symbols showing different values of the feature. Altogether 2,650 languages are shown on the maps
Surveys the breadth of corpus-based linguistic research on English, including chapters on collocations, phraseology, grammatical variation, historical change, and the description of registers and dialects.