Finding Secondary Datasets for Social Science Research

Primary vs. Secondary Data

Primary data is information collected firsthand by researchers specifically for their current study. This might involve conducting surveys, interviews, experiments, or direct observations. Primary data collection allows researchers to tailor their data gathering precisely to their research questions, but it can be time-consuming, expensive, and logistically challenging.

Secondary data, on the other hand, refers to data that has already been collected by other researchers or organizations for purposes that may or may not be related to the current research question. This can include government statistics, survey data from other studies, administrative records, or data from previous research projects.

Data Types in Social Science Research

Langevin, Jared. (2014). Human Behavior & Low Energy Architecture: Linking Environmental Adaptation, Personal Comfort, & Energy Use in the Built Environment. 10.13140/RG.2.1.4945.8728.

Before diving into specific data sources, it’s crucial to understand the different types of data structures commonly used in social science research. Three primary types are cross-sectional, repeated cross-sectional, and longitudinal data. Each offers unique insights and has its own strengths and limitations.

1. Cross-sectional Data

Cross-sectional data provides a snapshot of a population at a single point in time.

  • Characteristics:
    • Data collected from many different individuals or units at the same time
    • Offers a broad view of a population’s characteristics at a specific moment
    • Useful for studying prevalence and associations between variables
  • Example: A one-time survey of voter preferences before an election
  • Limitations:
    • Cannot track changes over time
    • Difficult to establish causality

2. Repeated Cross-sectional Data

Repeated cross-sectional data involves collecting data from different samples of the same population at multiple points in time.

  • Characteristics:
    • Series of cross-sectional surveys conducted over time
    • Different individuals in each survey, but same population
    • Allows for analysis of trends and changes at the population level
  • Example: The General Social Survey (GSS), conducted regularly since 1972
  • Advantages:
    • Can track societal changes (trends) over time
    • Avoids issues of panel attrition
  • Limitations:
    • Cannot track individual-level changes
    • May be affected by cohort effects

3. Longitudinal Data

Longitudinal data follows the same individuals or units over an extended period, collecting data at multiple time points.

  • Characteristics:
    • Tracks the same sample over time
    • Allows for analysis of individual-level changes and trajectories
    • Can help establish causal relationships
  • Types:
    • Panel studies: Same individuals interviewed repeatedly
    • Cohort studies: Follow a group with a common characteristic (e.g., birth year)
  • Example: The Panel Study of Income Dynamics, following families since 1968
  • Advantages:
    • Can track individual changes over time
    • Stronger for inferring causality
    • Allows for studying life course trajectories
  • Limitations:
    • More expensive and time-consuming to conduct
    • Can suffer from attrition (participants dropping out over time)

Sources

1. ICPSR (Inter-University Consortium for Social and Political Research)

ICPSR is the biggest data archive for social science research:

  • Access to over 19,000 datasets
  • Multiple search approaches, including variable-specific searches and topic browsing
  • A searchable bibliography of research articles using their data files
  • Free registration required for full access

2. IPUMS (Integrated Public Use Microdata Samples)

IPUMS provides census and survey data from around the world integrated across time and space. IPUMS integration and documentation makes it easy to study change, conduct comparative research, merge information across data types, and analyze individuals within family and community contexts. Data and services available free of charge.

  • IPUMS USA: Population data (US Census and American Community Survey)
  • IPUMS CPS: Population data (Current Population Survey)
  • IPUMS NAPP:
  • IPUMS Global Health: Health survey data from around the world, including harmonized data
  • IPUMS IHGIS: Tabular and GIS data from population, housing, and agricultural censuses around the world
  • IPUMS Time Use: Historical and contemporary time use data from 1930 to the present
  • IPUMS Higher Ed: Education data (NSF STEM data – SESTAT)
  • IPUMS Health Surveys: Health data (MEPS and NHIS)
  • IPUMS CDOH: The Contextual Determinants of Health (CDOH)

3. The General Social Survey (GSS)

The General Social Survey (GSS) is a nationally representative survey of adults in the United States conducted since 1972. The GSS collects data on contemporary American society in order to monitor and explain trends in opinions, attitudes and behaviors.

  • Has been conducted since 1972
  • Gathers data on people’s opinions on social, economic, and political issues
  • Offers repeated cross-sectional data (not panel data)
  • The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.

4. California Health Interview Survey (CHIS)

CHIS is the largest state health survey in the U.S. CHIS interviews more than 20,000 households on a wide range of health matters, from use of and access to health care, to health conditions and behaviors, to a range of topics that influence health: public program participation, housing, income and employment, climate change, food, gun violence, adverse childhood experiences, and much more:

  • Data on various racial and ethnic groups in California
  • Surveys conducted in multiple languages – In addition to English, CHIS is conducted in Spanish, Chinese (Mandarin and Cantonese dialects), Korean, Vietnamese, and Tagalog.
  • Separate interviews are conducted for adults (age 18 and older), adolescents (ages 12–17) and children (birth through 11 years of age).

5. Youth Risk Behavior Surveillance System (YRBSS)

The Youth Risk Behavior Survey (YRBS) collects data from students in grades 9—12 on key health behaviors and experiences that contribute to the leading causes of death and illness during both adolescence and adulthood.

  • Provides repeated cross-sectional data (not panel data)
  • Offers both national and state/district level data

6. Add Health (National Longitudinal Study of Adolescent to Adult Health)

The National Longitudinal Study of Adolescent to Adult Health (Add Health) is a longitudinal study of a nationally representative sample of over 20,000 adolescents who were in grades 7-12 during the 1994-95 school year, and have been followed for five waves to date, most recently in 2016-18. Ancillary studies have added even more data over the years.

  • Over the years, Add Health has collected rich demographic, social, familial, socioeconomic, behavioral, psychosocial, cognitive, and health survey data from participants and their parents; a vast array of contextual data from participants’ schools, neighborhoods, and geographies of residence; and in-home physical and biological data from participants, including genetic markers, blood-based assays, anthropometric measures, and medications.
  • Panel data
  • Offers a nationally representative sample

7. Panel Study of Income Dynamics

The PSID began in 1968 with a nationally representative sample of over 18,000 individuals living in 5,000 families in the United States.

  • Information on these individuals and their descendants has been collected continuously, including data covering employment, income, wealth, expenditures, health, marriage, childbearing, child development, philanthropy, education, and numerous other topics.
  • Panel data
  • Provides continuous data on original participants and their descendants

8. Future of Families and Child Wellbeing Study

The Future of Families and Child Wellbeing Study (FFCWS) is based on a stratified, multistage sample of about 5,000 children born in large U.S. cities (population over 200,000) between 1998 and 2000, where births to unmarried mothers were oversampled by a ratio of 3 to 1.

  • This sampling strategy resulted in the inclusion of a large number of Black, Hispanic, and low-income families.
  • Mothers were interviewed shortly after birth and fathers were interviewed at the hospital or by phone.
  • Topics: Beginning with the baseline interviews in 1998-2000, the core study was originally designed to primarily address four questions of great interest to researchers and policy makers: (1) What are the conditions and capabilities of unmarried parents, especially fathers?; (2) What is the nature of the relationships between unmarried parents?; (3) How do children born into these families fare?; and (4) How do policies and environmental conditions affect families and children?
  • Panel data

9. Health and Retirement Study (HRS)

The Health and Retirement Study (HRS) is a longitudinal household survey conducted by the Institute for Social Research at the University of Michigan. The multidisciplinary data provide researchers the opportunity to investigate many different aspects related to population aging in the United States.

  • Through its unique and in-depth interviews, the HRS provides an invaluable and growing body of multidisciplinary data that researchers can use to address important questions about the challenges and opportunities of aging.
  • Panel data

10. The Generation Study

The Generations study is the first long-term, five-year study to examine the health and well-being across three generations of lesbians, gay men, and bisexuals (LGB). The study explores identity, stress, health outcomes, and health care and services utilization among LGBs in three generations of adults who came of age at different historical contexts. Find out more about the study here.

  • Quantitative survey procedure will identify Black, Latino, and White LGB individuals in the United States. Respondents will participate in the study over a 5-year period to detect changes in the social environment as people age.
  • Panel data

11. Pew Research Center Datasets

Pew Research Center release data from its phone surveys as well as data from polls conducted on its online, nationally representative American Trends Panel (ATP) – the main source of most of its U.S.-based survey research. You can go to this page to see the complete list of available ATP datasets and the topics that they cover. There are two ways to locate datasets that are available for download. You can go to the “Tools & Resources” tab at the top of its website, which provides links to all available datasets, with surveys organized by primary research area and listed in reverse chronological order.

  • Cross-sectional data
  • American Trends Panel – Repeated Cross-sectional data

Steps to Explore Secondary Data

  1. Download the data in your preferred format (e.g., Stata, SPSS, CSV).
    • Choose the appropriate file format for your statistical software (e.g., .dta for Stata, .sav for SPSS, .csv for general use). If the data isn’t available in your preferred format, download a compatible format and convert it using your statistical software or a file conversion tool.
  2. Explore the codebook and variable list
    • The codebook is your roadmap to understanding the dataset. It typically includes:
    • Create a list of variables relevant to your research questions.
    • Pay attention to:
      • Sample size and any subsamples
      • Time period covered
      • Geographic areas included
      • Any changes in variable definitions across waves (for longitudinal or repeated cross-sectional data)
    • If codebook is not provided, consider using Stata commands to create the codebooks.
  3. Read published papers using the dataset
    • Search for academic papers that have used the dataset you’re interested in.
    • This can provide insights into:
      • The strengths and limitations of the data
      • Common analytical approaches
      • Potential pitfalls or challenges in working with the data
      • Ideas for your own research questions or analytical strategies
  4. Examine sampling design and weights
    • Understand how the sample was selected and any stratification used.
    • Check if sampling weights are provided and when they should be applied.
  5. Assess data quality
    • Look for any notes on data quality issues, such as high non-response rates for certain items.
    • Check for outliers, missing data patterns, and any data inconsistencies.
    • Understand how missing data is coded in the dataset.
  6. Prepare data for analysis
    • Merge files if necessary
    • Recode variables as needed
  • September 5, 2024