Beginner’s Guide on Data Codebook for Data Analysis

Understanding Variables, Values, and Labels

Before diving into codebook creation, you need to understand the fundamental terminology used in data documentation. These terms are often confused, but they have distinct meanings.

Variable Names vs. Variable Labels

Variable Name is the technical identifier used in your data file: what appears as the column header in your dataset. Variable names must follow certain rules:

  • No spaces (use underscores instead: income_annual not income annual)
  • No special characters except underscore
  • Often limited to a certain length (typically 32 characters)
  • Case-sensitive in some software
  • Should be concise but meaningful

Variable Label is the human-readable description of what the variable measures. It can be longer, contain spaces, and use full sentences.

Variable NameVariable Label
ageAge at baseline interview
inc_hh_yrTotal annual household income
phq9_totPHQ-9 depression symptom scale total score
dx_mh_anyAny mental health diagnosis (lifetime)
svc_hrs_2023Total service hours received in 2023

Good variable names are:

  • Short but meaningful (age not a, income not var37)
  • Consistent across your dataset (if you use _tot for totals, use it everywhere)
  • Avoid abbreviations that only you understand (tx could mean “treatment” or “Texas”)

Values vs. Value Labels

Values are the actual data stored in your dataset: the numbers or codes you see when you open the data file.

Value Labels are the human-readable meanings attached to those values.

This distinction is crucial for categorical variables:

Example 1: Gender Variable

Variable Name: gender
Variable Label: Self-reported gender identity

Raw data (what you see in the file):
1, 2, 1, 3, 1, 2, 2, 1...

Value Labels (what the numbers mean):
1 = Male
2 = Female  
3 = Non-binary
4 = Other
-99 = Refused to answer

Example 2: Education Variable

Variable Name: edu_level
Variable Label: Highest educational attainment

Raw data:
1, 3, 2, 5, 1, 2, 4, 3...

Value Labels:
1 = Less than high school
2 = High school diploma or GED
3 = Some college, no degree
4 = Associate degree
5 = Bachelor degree
6 = Graduate or professional degree
-99 = Refused
-88 = Don't know

Why This Matters

Without value labels, your data is nearly impossible to interpret. Consider this scenario:

You receive a dataset with a variable called outcome containing values 0, 1, and 2. Without value labels, you cannot know:

  • Does 0 = failure and 1 = success, or vice versa?
  • What does 2 mean?
  • Are higher numbers better or worse?

The codebook must document both the values AND their labels.

Numeric Variables: A Special Case

For continuous numeric variables (like age, income, or test scores), the values ARE the meaningful data. They don’t need separate labels:

Variable Name: age
Variable Label: Age in years at baseline interview
Type: Integer
Valid Range: 18-89

The values are:
18, 42, 67, 23, 55, 34...

These are actual ages, not codes that need labels.

However, even numeric variables may have special codes:

Variable Name: income_annual  
Variable Label: Total household income in past 12 months
Type: Float
Valid Range: 0-250000
Units: US Dollars

Most values are actual dollar amounts:
35000, 48200, 67500, 22000...

But special codes indicate missing data:
-99 = Refused to answer
-88 = Don't know
-77 = Question not asked (skip pattern)

Variable Documentation

Every variable in your codebook could include:

ComponentDescriptionExample
Variable NameTechnical identifier in data filedepression_score
Variable LabelBrief descriptionPHQ-9 depression symptom scale
DescriptionDetailed explanation9-item self-report measure of depression symptoms over past 2 weeks. Each item scored 0-3.
Question TextExact wording if from survey“Over the past 2 weeks, how often have you been bothered by the following problems?”
Response OptionsWhat respondent sees0=Not at all, 1=Several days, 2=More than half the days, 3=Nearly every day
Data TypeStorage formatInteger, Float, String, Date
Measurement LevelStatistical propertiesNominal, Ordinal, Interval, Ratio
ValuesActual codes in data0, 1, 2, 3, … 27
Value LabelsMeaning of codes (if categorical)N/A (this is a sum score)
Valid RangeAcceptable values0-27
Missing CodesSpecial codes for missing-99 = Not administered, -88 = Incomplete (<9 items answered)
UnitsMeasurement unitsPoints on scale
NotesAdditional informationClinical cutoff: ≥10 indicates likely depression. Scores ≥15, ≥20 indicate moderate, severe depression.

Real-World Examples

Example 1: Simple Categorical Variable

Variable Name: housing_status
Variable Label: Current housing situation
Description: Participant's primary residence type at time of interview
Question Text: "Where are you currently living?"
Data Type: Integer
Measurement Level: Nominal

Values and Value Labels:
1 = Own home or apartment
2 = Rent home or apartment  
3 = Living with family or friends (not paying rent)
4 = Emergency shelter
5 = Transitional housing
6 = Street/car/abandoned building
7 = Jail/prison/detention
8 = Hospital/treatment facility
9 = Other (specify)
-99 = Refused to answer
-88 = Don't know

Valid Range: 1-9, plus missing codes
Missing Codes: -99 (Refused), -88 (Don't know)

Notes: Multiple residences coded to where participant spent most nights in past 30 days.

Example 2: Ordinal Variable

Variable Name: health_self
Variable Label: Self-rated general health status  
Description: Single-item measure of overall health (standard CDC question)
Question Text: "Would you say that in general your health is..."
Data Type: Integer
Measurement Level: Ordinal

Values and Value Labels:
1 = Excellent
2 = Very good
3 = Good
4 = Fair
5 = Poor
-99 = Refused to answer

Valid Range: 1-5
Missing Codes: -99 (Refused)

Notes: Higher values indicate worse health. Widely used in population health surveys.
Strongly associated with mortality and morbidity.

Example 3: Continuous Numeric Variable

Variable Name: bmi
Variable Label: Body Mass Index
Description: Calculated from self-reported height and weight
Question Text: N/A (derived variable)
Data Type: Float  
Measurement Level: Ratio

Values: Continuous numeric (actual BMI calculations)
Example values: 18.5, 22.3, 27.8, 31.2, 24.6

Value Labels: N/A (values are meaningful measurements, not codes)

Valid Range: 12.0-60.0
Missing Codes: -99 = Height or weight missing or refused

Units: kg/m²

Notes: Calculated as weight(kg) / height(m)². Values <15 or >50 flagged for review.
Standard categories: <18.5 underweight, 18.5-24.9 normal, 25-29.9 overweight, ≥30 obese.
Based on self-report, may underestimate true BMI.

Example 4: Date Variable

Variable Name: interview_date
Variable Label: Date of baseline interview
Description: Date when participant completed baseline survey
Data Type: Date
Measurement Level: Interval

Values: Dates in YYYY-MM-DD format
Example values: 2020-03-15, 2020-07-22, 2021-01-08

Value Labels: N/A (dates are actual dates, not codes)

Valid Range: 2020-01-01 to 2023-12-31
Missing Codes: 9999-99-99 = Interview not completed

Format: ISO 8601 (YYYY-MM-DD)

Notes: All interviews conducted in person except 847 telephone interviews during COVID-19 (March-June 2020).

Example 5: String/Text Variable

Variable Name: client_id
Variable Label: Unique client identifier
Description: De-identified unique ID assigned to each participant
Data Type: String
Measurement Level: Nominal (identifier only)

Values: 10-character alphanumeric codes
Example values: A3F782B1C9, K8D1M5N2P7, B4G9H3J6L8

Value Labels: N/A (each value is unique identifier)

Valid Range: N/A
Missing Codes: None (every case has an ID)

Format: 10-character alphanumeric string

Notes: Cannot be linked back to identifiable information. First character indicates cohort (A=2020, B=2021, C=2022, D=2023).

What is a Codebook?

A codebook (also called a data dictionary) is a comprehensive reference document that explains everything about the variables in your dataset. It tells you what each variable means, how it was measured, what the values represent, and how to interpret the data correctly.

Think of it as the instruction manual for your dataset. Without it, you’re trying to assemble furniture without instructions. You might figure it out eventually, but you’ll probably make mistakes along the way.

Here’s how they differ from other research documents:

DocumentPurposeWhen CreatedContains
QuestionnaireCollect data from participantsBefore data collectionQuestions asked to participants, response options, instructions
Survey InstrumentMeasure constructs/conceptsBefore data collectionValidated scales, measurement tools, scoring instructions
CodebookDocument the final datasetDuring/after data cleaningVariable names, value codes, data types, what’s actually in the data file
Analysis PlanGuide data analysisBefore analysisStatistical methods, hypotheses, variable relationships

Example to clarify the difference:

Questionnaire asks: "What is your current employment status?"
☐ Employed full-time
☐ Employed part-time  
☐ Unemployed
☐ Retired
☐ Student
☐ Unable to work

The codebook documents:
Variable name: employ_status
Variable label: Current employment status
Data type: Integer
Values and labels:
  1 = Employed full-time (n=3,456, 32.1%)
  2 = Employed part-time (n=2,034, 18.9%)
  3 = Unemployed (n=2,521, 23.4%)
  4 = Retired (n=841, 7.8%)
  5 = Student (n=456, 4.2%)
  6 = Unable to work (n=1,638, 15.2%)
  7 = Other (n=287, 2.7%)
  -99 = Refused to answer (n=123)
Missing: 234 cases (2.1%)

The questionnaire shows what participants saw. The codebook shows what ended up in your data file and how to interpret it.

Why You Need a Codebook

Without a codebook, you face serious problems:

Problem 1: Ambiguous codes You see a variable outcome with values 0 and 1. Does 0 mean success or failure? You’re guessing.

Problem 2: Hidden missing data codes You calculate mean income and get $45,237. But you didn’t know that -99 means “refused to answer” and you just included those in your calculation. Your results are wrong.

Problem 3: Lost institutional knowledge Six months later, you can’t remember what var_37_rec means. A collaborator joins your project and has no idea what any variable represents.

Problem 4: Impossible replication Another researcher wants to replicate your study but can’t figure out how you defined “service completion” or what your inclusion criteria were.

Primary Data vs. Secondary Data: Who Creates the Codebook?

If You’re Collecting Primary Data

You must create the codebook. Start with a preliminary version during study design (listing planned variables, coding schemes, and missing data codes). Update it continuously during data collection and cleaning, documenting any changes, recoding decisions, or data quality issues. Create detailed entries for any derived variables you calculate, including exact formulas and source variables. Finalize the codebook before beginning analysis, and maintain version control as your project evolves.

If You’re Using Secondary Data

You should receive an existing codebook from the data provider. Carefully review this codebook to understand variable definitions, value labels, and missing data codes. Always verify that the codebook matches your actual data file. Check that all variables are documented and that value ranges align with what you observe. Pay special attention to missing data codes, as these vary across datasets (some use -99, others use 9999 or .).

If the codebook is inadequate, unclear, or missing, you can create your own supplementary codebook. You can use your software or Python to generate automated descriptive statistics for each variable (min, max, mean, frequencies). Look for suspicious patterns that might indicate undocumented missing codes (like repeated values of 99, 999, or -99). Make educated guesses about variable meanings based on names and distributions, but clearly document these as assumptions that need verification. Contact the data provider when possible to clarify ambiguities. Most importantly, maintain clear notes about what you know with certainty versus what you’re assuming, so future users (including yourself) understand the limitations.

References

https://www.samhsa.gov/data/get-help/codebooks/what-codebook

https://www.icpsr.umich.edu/sites/icpsr/posts/shared/what-is-a-codebook

  • January 21, 2026