Beginner’s Guide on Data Codebook for Data Analysis
Understanding Variables, Values, and Labels
Before diving into codebook creation, you need to understand the fundamental terminology used in data documentation. These terms are often confused, but they have distinct meanings.
Variable Names vs. Variable Labels
Variable Name is the technical identifier used in your data file: what appears as the column header in your dataset. Variable names must follow certain rules:
- No spaces (use underscores instead:
income_annualnotincome annual) - No special characters except underscore
- Often limited to a certain length (typically 32 characters)
- Case-sensitive in some software
- Should be concise but meaningful
Variable Label is the human-readable description of what the variable measures. It can be longer, contain spaces, and use full sentences.
| Variable Name | Variable Label |
|---|---|
age | Age at baseline interview |
inc_hh_yr | Total annual household income |
phq9_tot | PHQ-9 depression symptom scale total score |
dx_mh_any | Any mental health diagnosis (lifetime) |
svc_hrs_2023 | Total service hours received in 2023 |
Good variable names are:
- Short but meaningful (
agenota,incomenotvar37) - Consistent across your dataset (if you use
_totfor totals, use it everywhere) - Avoid abbreviations that only you understand (
txcould mean “treatment” or “Texas”)
Values vs. Value Labels

Values are the actual data stored in your dataset: the numbers or codes you see when you open the data file.
Value Labels are the human-readable meanings attached to those values.
This distinction is crucial for categorical variables:
Example 1: Gender Variable
Variable Name: gender
Variable Label: Self-reported gender identity
Raw data (what you see in the file):
1, 2, 1, 3, 1, 2, 2, 1...
Value Labels (what the numbers mean):
1 = Male
2 = Female
3 = Non-binary
4 = Other
-99 = Refused to answer
Example 2: Education Variable
Variable Name: edu_level
Variable Label: Highest educational attainment
Raw data:
1, 3, 2, 5, 1, 2, 4, 3...
Value Labels:
1 = Less than high school
2 = High school diploma or GED
3 = Some college, no degree
4 = Associate degree
5 = Bachelor degree
6 = Graduate or professional degree
-99 = Refused
-88 = Don't know
Why This Matters
Without value labels, your data is nearly impossible to interpret. Consider this scenario:
You receive a dataset with a variable called outcome containing values 0, 1, and 2. Without value labels, you cannot know:
- Does 0 = failure and 1 = success, or vice versa?
- What does 2 mean?
- Are higher numbers better or worse?
The codebook must document both the values AND their labels.
Numeric Variables: A Special Case
For continuous numeric variables (like age, income, or test scores), the values ARE the meaningful data. They don’t need separate labels:
Variable Name: age
Variable Label: Age in years at baseline interview
Type: Integer
Valid Range: 18-89
The values are:
18, 42, 67, 23, 55, 34...
These are actual ages, not codes that need labels.
However, even numeric variables may have special codes:
Variable Name: income_annual
Variable Label: Total household income in past 12 months
Type: Float
Valid Range: 0-250000
Units: US Dollars
Most values are actual dollar amounts:
35000, 48200, 67500, 22000...
But special codes indicate missing data:
-99 = Refused to answer
-88 = Don't know
-77 = Question not asked (skip pattern)
Variable Documentation

Every variable in your codebook could include:
| Component | Description | Example |
|---|---|---|
| Variable Name | Technical identifier in data file | depression_score |
| Variable Label | Brief description | PHQ-9 depression symptom scale |
| Description | Detailed explanation | 9-item self-report measure of depression symptoms over past 2 weeks. Each item scored 0-3. |
| Question Text | Exact wording if from survey | “Over the past 2 weeks, how often have you been bothered by the following problems?” |
| Response Options | What respondent sees | 0=Not at all, 1=Several days, 2=More than half the days, 3=Nearly every day |
| Data Type | Storage format | Integer, Float, String, Date |
| Measurement Level | Statistical properties | Nominal, Ordinal, Interval, Ratio |
| Values | Actual codes in data | 0, 1, 2, 3, … 27 |
| Value Labels | Meaning of codes (if categorical) | N/A (this is a sum score) |
| Valid Range | Acceptable values | 0-27 |
| Missing Codes | Special codes for missing | -99 = Not administered, -88 = Incomplete (<9 items answered) |
| Units | Measurement units | Points on scale |
| Notes | Additional information | Clinical cutoff: ≥10 indicates likely depression. Scores ≥15, ≥20 indicate moderate, severe depression. |
Real-World Examples
Example 1: Simple Categorical Variable
Variable Name: housing_status
Variable Label: Current housing situation
Description: Participant's primary residence type at time of interview
Question Text: "Where are you currently living?"
Data Type: Integer
Measurement Level: Nominal
Values and Value Labels:
1 = Own home or apartment
2 = Rent home or apartment
3 = Living with family or friends (not paying rent)
4 = Emergency shelter
5 = Transitional housing
6 = Street/car/abandoned building
7 = Jail/prison/detention
8 = Hospital/treatment facility
9 = Other (specify)
-99 = Refused to answer
-88 = Don't know
Valid Range: 1-9, plus missing codes
Missing Codes: -99 (Refused), -88 (Don't know)
Notes: Multiple residences coded to where participant spent most nights in past 30 days.
Example 2: Ordinal Variable
Variable Name: health_self
Variable Label: Self-rated general health status
Description: Single-item measure of overall health (standard CDC question)
Question Text: "Would you say that in general your health is..."
Data Type: Integer
Measurement Level: Ordinal
Values and Value Labels:
1 = Excellent
2 = Very good
3 = Good
4 = Fair
5 = Poor
-99 = Refused to answer
Valid Range: 1-5
Missing Codes: -99 (Refused)
Notes: Higher values indicate worse health. Widely used in population health surveys.
Strongly associated with mortality and morbidity.
Example 3: Continuous Numeric Variable
Variable Name: bmi
Variable Label: Body Mass Index
Description: Calculated from self-reported height and weight
Question Text: N/A (derived variable)
Data Type: Float
Measurement Level: Ratio
Values: Continuous numeric (actual BMI calculations)
Example values: 18.5, 22.3, 27.8, 31.2, 24.6
Value Labels: N/A (values are meaningful measurements, not codes)
Valid Range: 12.0-60.0
Missing Codes: -99 = Height or weight missing or refused
Units: kg/m²
Notes: Calculated as weight(kg) / height(m)². Values <15 or >50 flagged for review.
Standard categories: <18.5 underweight, 18.5-24.9 normal, 25-29.9 overweight, ≥30 obese.
Based on self-report, may underestimate true BMI.
Example 4: Date Variable
Variable Name: interview_date
Variable Label: Date of baseline interview
Description: Date when participant completed baseline survey
Data Type: Date
Measurement Level: Interval
Values: Dates in YYYY-MM-DD format
Example values: 2020-03-15, 2020-07-22, 2021-01-08
Value Labels: N/A (dates are actual dates, not codes)
Valid Range: 2020-01-01 to 2023-12-31
Missing Codes: 9999-99-99 = Interview not completed
Format: ISO 8601 (YYYY-MM-DD)
Notes: All interviews conducted in person except 847 telephone interviews during COVID-19 (March-June 2020).
Example 5: String/Text Variable
Variable Name: client_id
Variable Label: Unique client identifier
Description: De-identified unique ID assigned to each participant
Data Type: String
Measurement Level: Nominal (identifier only)
Values: 10-character alphanumeric codes
Example values: A3F782B1C9, K8D1M5N2P7, B4G9H3J6L8
Value Labels: N/A (each value is unique identifier)
Valid Range: N/A
Missing Codes: None (every case has an ID)
Format: 10-character alphanumeric string
Notes: Cannot be linked back to identifiable information. First character indicates cohort (A=2020, B=2021, C=2022, D=2023).
What is a Codebook?

A codebook (also called a data dictionary) is a comprehensive reference document that explains everything about the variables in your dataset. It tells you what each variable means, how it was measured, what the values represent, and how to interpret the data correctly.
Think of it as the instruction manual for your dataset. Without it, you’re trying to assemble furniture without instructions. You might figure it out eventually, but you’ll probably make mistakes along the way.
Here’s how they differ from other research documents:
| Document | Purpose | When Created | Contains |
|---|---|---|---|
| Questionnaire | Collect data from participants | Before data collection | Questions asked to participants, response options, instructions |
| Survey Instrument | Measure constructs/concepts | Before data collection | Validated scales, measurement tools, scoring instructions |
| Codebook | Document the final dataset | During/after data cleaning | Variable names, value codes, data types, what’s actually in the data file |
| Analysis Plan | Guide data analysis | Before analysis | Statistical methods, hypotheses, variable relationships |
Example to clarify the difference:
Questionnaire asks: "What is your current employment status?"
☐ Employed full-time
☐ Employed part-time
☐ Unemployed
☐ Retired
☐ Student
☐ Unable to work
The codebook documents:
Variable name: employ_status
Variable label: Current employment status
Data type: Integer
Values and labels:
1 = Employed full-time (n=3,456, 32.1%)
2 = Employed part-time (n=2,034, 18.9%)
3 = Unemployed (n=2,521, 23.4%)
4 = Retired (n=841, 7.8%)
5 = Student (n=456, 4.2%)
6 = Unable to work (n=1,638, 15.2%)
7 = Other (n=287, 2.7%)
-99 = Refused to answer (n=123)
Missing: 234 cases (2.1%)
The questionnaire shows what participants saw. The codebook shows what ended up in your data file and how to interpret it.
Why You Need a Codebook
Without a codebook, you face serious problems:
Problem 1: Ambiguous codes You see a variable outcome with values 0 and 1. Does 0 mean success or failure? You’re guessing.
Problem 2: Hidden missing data codes You calculate mean income and get $45,237. But you didn’t know that -99 means “refused to answer” and you just included those in your calculation. Your results are wrong.
Problem 3: Lost institutional knowledge Six months later, you can’t remember what var_37_rec means. A collaborator joins your project and has no idea what any variable represents.
Problem 4: Impossible replication Another researcher wants to replicate your study but can’t figure out how you defined “service completion” or what your inclusion criteria were.
Primary Data vs. Secondary Data: Who Creates the Codebook?
If You’re Collecting Primary Data
You must create the codebook. Start with a preliminary version during study design (listing planned variables, coding schemes, and missing data codes). Update it continuously during data collection and cleaning, documenting any changes, recoding decisions, or data quality issues. Create detailed entries for any derived variables you calculate, including exact formulas and source variables. Finalize the codebook before beginning analysis, and maintain version control as your project evolves.
If You’re Using Secondary Data
You should receive an existing codebook from the data provider. Carefully review this codebook to understand variable definitions, value labels, and missing data codes. Always verify that the codebook matches your actual data file. Check that all variables are documented and that value ranges align with what you observe. Pay special attention to missing data codes, as these vary across datasets (some use -99, others use 9999 or .).
If the codebook is inadequate, unclear, or missing, you can create your own supplementary codebook. You can use your software or Python to generate automated descriptive statistics for each variable (min, max, mean, frequencies). Look for suspicious patterns that might indicate undocumented missing codes (like repeated values of 99, 999, or -99). Make educated guesses about variable meanings based on names and distributions, but clearly document these as assumptions that need verification. Contact the data provider when possible to clarify ambiguities. Most importantly, maintain clear notes about what you know with certainty versus what you’re assuming, so future users (including yourself) understand the limitations.
References
https://www.samhsa.gov/data/get-help/codebooks/what-codebook
https://www.icpsr.umich.edu/sites/icpsr/posts/shared/what-is-a-codebook
