Finding Data & Statistics: Data Terminology

What is Data?

Common Data Terms

Data vs. Statistics

Data are raw ingredients from which statistics are created. Statistics are useful when you just need a few numbers to support an argument (ex. In 2003, 98.2% of American households had a television set--from Statistical Abstract of the United States). Statistics are usually presented in tables. Statistical analysis can be performed on data to show relationships among the variables collected. Through secondary data analysis, many different researchers can re-use the same data set for different purposes.

Aggregate/Macro Data vs. Microdata

Aggregate or Macro Data are higher-level data that have been compiled from smaller units of data. For example, the Census data that you find on the Census website have been aggregated to preserve the confidentiality of individual respondents. Microdata contain individual cases, usually individual people, or in the case of Census data, individual households. The Integrated Public Use Microdata Sample (IPUMS) for the Census provides access to the actual survey data from the Census, but eliminates information that would identify individuals.

Data Sets, Studies, and Series

In data archives like ICPSR, a data set or study is made up of the raw data file and any related files, usually the codebook and setup files. The codebook is your guide to making sense of the raw data. For survey data, the codebook usually contains the actual questionnaire and the values for the responses to each question. The setup files help will not display properly.

ICPSR uses the term series to describe collections of studies that have been repeated over time. For example, the National Health Interview Survey is conducted annually. In the ICPSR archive, you will find a description of the series that provides an overview. You will also find individual descriptions of each study (i.e. National Health Interview Survey, 2004). The study number in ICPSR refers to the individual survey.

Types of Data

Cross-Sectional describes data that are only collected once.

Time Series study the same variable over time. The National Health Interview Survey is an example of time series data because the questions generally remain the same over time, but the individual respondents vary.

Longitudinal Studies describe surveys that are conducted repeatedly, in which the same group of respondents are surveyed each time. This allows for examining changes over the life course. The Project on Human Development in Chicago Neighborhoods (PHDCN) Series contains a longitudinal component that tracks changes in the lives of individuals over time through interviews.

(Originally from Sue Erickson at Vanderbilt University)

Be a Data Detective

To find the "right data" - the data you need for your homework, project, or research - you'll want to think like a data detective, using the steps below.

1) Interview eyewitnesses

For publications in the same research area, what data are those researchers using? The data source should be cited in the references/bibliography or at least named in the methods section.

If most of the scholarly literature about the research topic refers to the same handful of sources, that's a hint that those sources are the best available. If researchers are collecting data themselves (surveys, etc.) that may mean there is no publicly available data source for that particular topic.

2) Gather the facts

Be able to specify these data facets:

What - What is your variable of interest? e.g., unemployment rate, educational attainment, voting record, electricity production, etc.
Who - Is your unit of observation an individual person? A household? A representative sample of a city or state or country?
When - What is the time scale of your data? Weekly, quarterly, yearly? Do you need multiple years, or single point in time?
Where - What is the spatial area - city, state, country? Are you doing a multi-place comparison?

3) Identify suspects

Could it have been collected by:

Government agency
- Demographics, large-scale socio-economic variables, anything a federal or state agency might be set up to monitor
- Equally true for US and non-US countries (although some countries have varying levels of data openness/availability)
A nonprofit or nongovernmental organization
- Public opinion polls
- Topic of local/regional interest
Private business or industry group
- Financial/business data
- Also public opinion polls, consumer surveys
Academic researchers
- Very niche/specific topics where data is not publicly available elsewhere

4) Accept the occasional "cold case"

If it seems like the data should exist - especially if it is something a government agency would collect - it probably does exist! But that doesn't mean it is:

Easy to find
In a user-friendly format
At the spatial or temporal scale you want
Includes your specific variables of interest
Accessible without restrictions or an application process

You may need to change your research question if your ideal data just isn't available. For instance, if you want quarterly data about your topic, would yearly data be sufficient? If you want to find data at zip code level, would city-level be close enough?