Linguistics: Data & Corpora

NOTE: We now have a membership to Linguistic Data Consortium (LDC). If you'd like a login, sign up for an account at the link below and you will be added as a user with access to our paid-for datasets. Contact us if you have any questions.

Linguistic Data Consortium (LDC)

Linguistics Data & Corpora

Bavarian Archive for Speech Signals Corpora
Corpus of Contemporary American English
The largest freely-available corpus of English.
The Dataverse Project
Open source research data repository software.
Dictionary of Old English Web Corpus
English-corpora.org
This preliminary one-year license only includes ONLINE USE of the corpora. If you’d like to download the corpora, you’ll need to purchase on your own. For each type or corpora, there is some free data available.

To access our license, use the VPN and create an account on English-corpora.org. Once you create an account, choose University of California, San Diego as your institution.

The website is a little circuitous, so if you have any questions or access issues, reach out.
International Computer Archive for Modern and Medieval English
ICAME is an international organization of linguists and information scientists working with English machine-readable texts.

more... less...

The aim of the organization is to collect and distribute information on English language material available for computer processing and on linguistic research completed or in progress on the material, to compile an archive of English text corpora in machine-readable form, and to make material available to research institutions.
Links for Corpus-based Linguistics
The annotated links on this site are mainly meant for linguists and language teachers who work with corpora

more... less...

Here, you’ll find links to any- & everything to do with the use of language corpora. The links are categorised and annotated to facilitate browsing/searching. Just click on a category in the left frame to see a list of links in this main window.
OLAC Language Resource Catalog
This catalog, developed by the Open Language Archives Community (OLAC), provides access to a wealth of information about thousands of languages, including details of text collections, audio recordings, dictionaries, and software, sourced from dozens of digital and traditional archives.
re3data.org
A global registry of research data repositories that covers research data repositories from different academic disciplines.

Linguistics: https://www.re3data.org/search?query=linguistics
Santa Barbara Corpus of Spoken American English
The Santa Barbara Corpus includes transcriptions, audio, and timestamps which correlate transcription and audio at the level of individual intonation units.
SIL International Language & Culture Archives
Search and browse over 40,000 resources dating from 1935 to the present that describe, document, and/or communicate in the languages and cultures SIL serves.

About: a faith-based nonprofit organization committed to serving language communities worldwide as they build capacity for sustainable language development.

Instructions: search for your topic, then use the filters to select "work type" to get data sets.

Statistical Natural Language Processing - Annotated List of Resources
Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources
Survey of English Usage
The Survey of English Usage carries out research in English language Corpus Linguistics, and was the first centre in Europe to undertake this type of research. From its inception in 1959, the Survey collected samples of naturally-occurring language for the purposes of description and analysis.
TIME Magazine Corpus
The TIME corpus is based on 100 million words of text in about 275,000 articles from TIME magazine from 1923-2006, and it serves as a great resource to examine changes in American English during this time.
TROLLing Dataverse
The Tromsø Repository of Language and Linguistics (TROLLing) is designed as an archive of linguistic data and statistical code. The archive is open access, which means that all information is available to to everyone.
UCLA Phonetics Lab Archive
Recordings of hundreds of languages from around the world, providing source materials for phonetic and phonological research, of value to scholars, speakers of the languages, and language learners alike.

more... less...

The materials on this site comprise audio recordings illustrating phonetic structures from over 200 languages with phonetic transcriptions, plus scans of original field notes where relevant.

ProQuest TDM Studio
TDM Studio is the text and data mining interface for more than 200 licensed ProQuest content products, including government, archival, dissertation, and news databases. Content is available for analysis with R or Python in the workbench dashboard, or use the visualization dashboard to interact with and visualize content without any coding needed. Contact Stephanie Labou (slabou@ucsd.edu) for additional assistance.

To create an account:
1. Go to https://tdmstudio.proquest.com
2. Click “Create an account” button
3. Use your UCSD email address to create your account.

Handbooks and Guides

Doing Corpus Linguistics

A practical step-by-step introduction to corpus linguistics.

Practical Corpus Linguistics

Provides a practical and student-friendly guide to corpus linguistics that explains the nature of electronic data and how it can be collected and analyzed.

The Cambridge Handbook of English Corpus Linguistics

Surveys the breadth of corpus-based linguistic research on English, including chapters on collocations, phraseology, grammatical variation, historical change, and the description of registers and dialects.

The World Atlas of Language Structures

The World Atlas of Language Structures is a book and CD combination displaying the structural properties of the world's languages.