Beginner’s Guide on Data Codebook for Data Analysis

Understanding Variables, Values, and Labels

Before diving into codebook creation, you need to understand the fundamental terminology used in data documentation. These terms are often confused, but they have distinct meanings.

Variable Names vs. Variable Labels

Variable Name is the technical identifier used in your data file: what appears as the column header in your dataset. Variable names must follow certain rules:

No spaces (use underscores instead: income_annual not income annual)
No special characters except underscore
Often limited to a certain length (typically 32 characters)
Case-sensitive in some software
Should be concise but meaningful

A variable label is a human-readable description of what the variable measures. Column names in datasets are often abbreviated or use technical conventions (phq9_baseline, gad7_baseline), which can be cryptic to readers unfamiliar with the data. Variable labels translate these column names into plain language:

Variable Name	Variable Label
`age`	Age at baseline interview
`inc_hh_yr`	Total annual household income
`phq9_tot`	PHQ-9 depression symptom scale total score
`dx_mh_any`	Any mental health diagnosis (lifetime)
`svc_hrs_2023`	Total service hours received in 2023

Good variable names are:

Short but meaningful (age not a, income not var37)
Consistent across your dataset (if you use _tot for totals, use it everywhere)
Avoid abbreviations that only you understand (tx could mean “treatment” or “Texas”)

Values vs. Value Labels

Values are the actual data stored in your dataset: the numbers or codes you see when you open the data file.

Value Labels are the human-readable meanings attached to those values. A value label explains what each specific value in a categorical variable means. Many datasets store categories as numbers for efficiency (1, 2, 3, 4 instead of “Male”, “Female”, “Non-binary”, “Other”), and value labels map those numbers back to their meanings.

Example 1: Gender Variable

Variable Name: gender
Variable Label: Self-reported gender identity

Raw data (what you see in the file):
1, 2, 1, 3, 1, 2, 2, 1...

Value Labels (what the numbers mean):
1 = Male
2 = Female  
3 = Non-binary
4 = Other
-99 = Refused to answer

Example 2: Education Variable

Variable Name: edu_level
Variable Label: Highest educational attainment

Raw data:
1, 3, 2, 5, 1, 2, 4, 3...

Value Labels:
1 = Less than high school
2 = High school diploma or GED
3 = Some college, no degree
4 = Associate degree
5 = Bachelor degree
6 = Graduate or professional degree
-99 = Refused
-88 = Don't know

Without value labels, your data is nearly impossible to interpret. Consider this scenario:

You receive a dataset with a variable called outcome containing values 0, 1, and 2. Without value labels, you cannot know:

Does 0 = failure and 1 = success, or vice versa?
What does 2 mean?
Are higher numbers better or worse?

The codebook must document both the values AND their labels.

Numeric Variables: A Special Case

For continuous numeric variables (like age, income, or test scores), the values ARE the meaningful data. They don’t need separate labels:

Variable Name: age
Variable Label: Age in years at baseline interview
Type: Integer
Valid Range: 18-89

The values are:
18, 42, 67, 23, 55, 34...

These are actual ages, not codes that need labels.

However, even numeric variables may have special codes:

Variable Name: income_annual  
Variable Label: Total household income in past 12 months
Type: Float
Valid Range: 0-250000
Units: US Dollars

Most values are actual dollar amounts:
35000, 48200, 67500, 22000...

But special codes indicate missing data:
-99 = Refused to answer
-88 = Don't know
-77 = Question not asked (skip pattern)

Variable Documentation

Every variable in your codebook could include:

Component	Description	Example
Variable Name	Technical identifier in data file	`depression_score`
Variable Label	Brief description	PHQ-9 depression symptom scale
Description	Detailed explanation	9-item self-report measure of depression symptoms over past 2 weeks. Each item scored 0-3.
Question Text	Exact wording if from survey	“Over the past 2 weeks, how often have you been bothered by the following problems?”
Response Options	What respondent sees	0=Not at all, 1=Several days, 2=More than half the days, 3=Nearly every day
Data Type	Storage format	Integer, Float, String, Date
Measurement Level	Statistical properties	Nominal, Ordinal, Interval, Ratio
Values	Actual codes in data	0, 1, 2, 3, … 27
Value Labels	Meaning of codes (if categorical)	N/A (this is a sum score)
Valid Range	Acceptable values	0-27
Missing Codes	Special codes for missing	-99 = Not administered, -88 = Incomplete (<9 items answered)
Units	Measurement units	Points on scale
Notes	Additional information	Clinical cutoff: ≥10 indicates likely depression. Scores ≥15, ≥20 indicate moderate, severe depression.

Real-World Examples

Example 1: Simple Categorical Variable

Variable Name: housing_status
Variable Label: Current housing situation
Description: Participant's primary residence type at time of interview
Question Text: "Where are you currently living?"
Data Type: Integer
Measurement Level: Nominal

Values and Value Labels:
1 = Own home or apartment
2 = Rent home or apartment  
3 = Living with family or friends (not paying rent)
4 = Emergency shelter
5 = Transitional housing
6 = Street/car/abandoned building
7 = Jail/prison/detention
8 = Hospital/treatment facility
9 = Other (specify)
-99 = Refused to answer
-88 = Don't know

Valid Range: 1-9, plus missing codes
Missing Codes: -99 (Refused), -88 (Don't know)

Notes: Multiple residences coded to where participant spent most nights in past 30 days.

Example 2: Ordinal Variable

Variable Name: health_self
Variable Label: Self-rated general health status  
Description: Single-item measure of overall health (standard CDC question)
Question Text: "Would you say that in general your health is..."
Data Type: Integer
Measurement Level: Ordinal

Values and Value Labels:
1 = Excellent
2 = Very good
3 = Good
4 = Fair
5 = Poor
-99 = Refused to answer

Valid Range: 1-5
Missing Codes: -99 (Refused)

Notes: Higher values indicate worse health. Widely used in population health surveys.
Strongly associated with mortality and morbidity.

Example 3: Continuous Numeric Variable

Variable Name: bmi
Variable Label: Body Mass Index
Description: Calculated from self-reported height and weight
Question Text: N/A (derived variable)
Data Type: Float  
Measurement Level: Ratio

Values: Continuous numeric (actual BMI calculations)
Example values: 18.5, 22.3, 27.8, 31.2, 24.6

Value Labels: N/A (values are meaningful measurements, not codes)

Valid Range: 12.0-60.0
Missing Codes: -99 = Height or weight missing or refused

Units: kg/m²

Notes: Calculated as weight(kg) / height(m)². Values <15 or >50 flagged for review.
Standard categories: <18.5 underweight, 18.5-24.9 normal, 25-29.9 overweight, ≥30 obese.
Based on self-report, may underestimate true BMI.

Example 4: Date Variable

Variable Name: interview_date
Variable Label: Date of baseline interview
Description: Date when participant completed baseline survey
Data Type: Date
Measurement Level: Interval

Values: Dates in YYYY-MM-DD format
Example values: 2020-03-15, 2020-07-22, 2021-01-08

Value Labels: N/A (dates are actual dates, not codes)

Valid Range: 2020-01-01 to 2023-12-31
Missing Codes: 9999-99-99 = Interview not completed

Format: ISO 8601 (YYYY-MM-DD)

Notes: All interviews conducted in person except 847 telephone interviews during COVID-19 (March-June 2020).

Example 5: String/Text Variable

Variable Name: client_id
Variable Label: Unique client identifier
Description: De-identified unique ID assigned to each participant
Data Type: String
Measurement Level: Nominal (identifier only)

Values: 10-character alphanumeric codes
Example values: A3F782B1C9, K8D1M5N2P7, B4G9H3J6L8

Value Labels: N/A (each value is unique identifier)

Valid Range: N/A
Missing Codes: None (every case has an ID)

Format: 10-character alphanumeric string

Notes: Cannot be linked back to identifiable information. First character indicates cohort (A=2020, B=2021, C=2022, D=2023).

What is a Codebook?

A codebook (also called a data dictionary) is a comprehensive reference document that explains everything about the variables in your dataset. It tells you what each variable means, how it was measured, what the values represent, and how to interpret the data correctly.

Think of it as the instruction manual for your dataset. Without it, you’re trying to assemble furniture without instructions. You might figure it out eventually, but you’ll probably make mistakes along the way.

Here’s how they differ from other research documents:

Document	Purpose	When Created	Contains
Questionnaire	Collect data from participants	Before data collection	Questions asked to participants, response options, instructions
Survey Instrument	Measure constructs/concepts	Before data collection	Validated scales, measurement tools, scoring instructions
Codebook	Document the final dataset	During/after data cleaning	Variable names, value codes, data types, what’s actually in the data file
Analysis Plan	Guide data analysis	Before analysis	Statistical methods, hypotheses, variable relationships

Example to clarify the difference:

Questionnaire asks: "What is your current employment status?"
☐ Employed full-time
☐ Employed part-time  
☐ Unemployed
☐ Retired
☐ Student
☐ Unable to work

The codebook documents:
Variable name: employ_status
Variable label: Current employment status
Data type: Integer
Values and labels:
  1 = Employed full-time (n=3,456, 32.1%)
  2 = Employed part-time (n=2,034, 18.9%)
  3 = Unemployed (n=2,521, 23.4%)
  4 = Retired (n=841, 7.8%)
  5 = Student (n=456, 4.2%)
  6 = Unable to work (n=1,638, 15.2%)
  7 = Other (n=287, 2.7%)
  -99 = Refused to answer (n=123)
Missing: 234 cases (2.1%)

The questionnaire shows what participants saw. The codebook shows what ended up in your data file and how to interpret it.

Why You Need a Codebook

Without a codebook, you face serious problems:

Problem 1: Ambiguous codes You see a variable outcome with values 0 and 1. Does 0 mean success or failure? You’re guessing.

Problem 2: Hidden missing data codes You calculate mean income and get $45,237. But you didn’t know that -99 means “refused to answer” and you just included those in your calculation. Your results are wrong.

Problem 3: Lost institutional knowledge Six months later, you can’t remember what var_37_rec means. A collaborator joins your project and has no idea what any variable represents.

Problem 4: Impossible replication Another researcher wants to replicate your study but can’t figure out how you defined “service completion” or what your inclusion criteria were.

Primary Data vs. Secondary Data: Who Creates the Codebook?

If You’re Collecting Primary Data

You must create the codebook. Start with a preliminary version during study design (listing planned variables, coding schemes, and missing data codes). Update it continuously during data collection and cleaning, documenting any changes, recoding decisions, or data quality issues. Create detailed entries for any derived variables you calculate, including exact formulas and source variables. Finalize the codebook before beginning analysis, and maintain version control as your project evolves.

If You’re Using Secondary Data

You should receive an existing codebook from the data provider. Carefully review this codebook to understand variable definitions, value labels, and missing data codes. Always verify that the codebook matches your actual data file. Check that all variables are documented and that value ranges align with what you observe. Pay special attention to missing data codes, as these vary across datasets (some use -99, others use 9999 or .).

If the codebook is inadequate, unclear, or missing, you can create your own supplementary codebook. You can use your software or Python to generate automated descriptive statistics for each variable (min, max, mean, frequencies). Look for suspicious patterns that might indicate undocumented missing codes (like repeated values of 99, 999, or -99). Make educated guesses about variable meanings based on names and distributions, but clearly document these as assumptions that need verification. Contact the data provider when possible to clarify ambiguities. Most importantly, maintain clear notes about what you know with certainty versus what you’re assuming, so future users (including yourself) understand the limitations.

References

https://www.samhsa.gov/data/get-help/codebooks/what-codebook

https://www.icpsr.umich.edu/sites/icpsr/posts/shared/what-is-a-codebook

January 21, 2025

Beginner’s Guide on Data Codebook for Data Analysis

Understanding Variables, Values, and Labels

Variable Names vs. Variable Labels

Values vs. Value Labels

Example 1: Gender Variable

Example 2: Education Variable

Numeric Variables: A Special Case

Variable Documentation

Real-World Examples

Example 1: Simple Categorical Variable

Example 2: Ordinal Variable

Example 3: Continuous Numeric Variable

Example 4: Date Variable

Example 5: String/Text Variable

What is a Codebook?

Why You Need a Codebook

Primary Data vs. Secondary Data: Who Creates the Codebook?

If You’re Collecting Primary Data

If You’re Using Secondary Data

References

Related Posts

Leave a ReplyCancel reply

Translate this page into:

Categories

Beginner’s Guide on Data Codebook for Data Analysis

Understanding Variables, Values, and Labels

Variable Names vs. Variable Labels

Values vs. Value Labels

Example 1: Gender Variable

Example 2: Education Variable

Numeric Variables: A Special Case

Variable Documentation

Real-World Examples

Example 1: Simple Categorical Variable

Example 2: Ordinal Variable

Example 3: Continuous Numeric Variable

Example 4: Date Variable

Example 5: String/Text Variable

What is a Codebook?

Why You Need a Codebook

Primary Data vs. Secondary Data: Who Creates the Codebook?

If You’re Collecting Primary Data

If You’re Using Secondary Data

References

Share this:

Related Posts

Leave a ReplyCancel reply

Translate this page into:

Categories