Skip to main content

Linguistics: Data & Statistics

Linguistic Corpora

Linguistic Corpora: A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language (corpus linguistics).

-David Crystal. A Dictionary of Linguistics and Phonetics, 2003 [ ]

NOTE: We are working on obtaining access to Linguistic Data Consortium (LDC).  If you need specific data sets, please contact your librarian - Tamara Rhodes at

Linguistics Data & Corpora

About: a faith-based nonprofit organization committed to serving language communities worldwide as they build capacity for sustainable language development.

Instructions: search for your topic, then use the filters to select "work type" to get data sets.

Tools & Software

An open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython.



Doing Corpus Linguistics

A practical step-by-step introduction to corpus linguistics.

Practical Corpus Linguistics

Provides a practical and student-friendly guide to corpus linguistics that explains the nature of electronic data and how it can be collected and analyzed.

The Cambridge Handbook of English Corpus Linguistics

Surveys the breadth of corpus-based linguistic research on English, including chapters on collocations, phraseology, grammatical variation, historical change, and the description of registers and dialects.

The World Atlas of Language Structures

The World Atlas of Language Structures is a book and CD combination displaying the structural properties of the world's languages.


Is this page useful?
Absolutely!: 1 votes (50%)
Yes: 0 votes (0%)
Sort of: 0 votes (0%)
No: 0 votes (0%)
Absolutely not!: 1 votes (50%)
Total Votes: 2