[Python] Creating/writing a codebook in Python using pandas and python-docx
This tutorial walks through creating a codebook programmatically: loading data, computing summary statistics, and exporting to a Word document. For background on what codebooks are, why they matter, and the concepts of variable labels and value labels, see Understanding Codebook for Data Analysis.
Traditionally, creating a codebook requires manually computing descriptive statistics for each variable in statistical software (SPSS, Stata, Excel), then copying and pasting results into a Word or PDF document. This process is time-consuming and error-prone, especially for datasets with many variables. With Python, we can automate most of this workflow: load data, compute statistics, format tables, and generate a professional document in seconds.
The Scenario: A Community Mental Health Program Evaluation
Imagine you’re evaluating a community mental health program. Your agency has collected data on 200 clients who received services over the past year. The dataset includes demographic information, clinical assessments, and service utilization records. Before analyzing questions like “Did clients improve over time?” or “Which subgroups benefited most?”, you need to document what you have. A codebook answers basic questions: How many clients are in each age group? What’s the distribution of baseline depression scores? How much missing data do we have?
Let’s build that codebook using Python step by step. We will use a mock community mental health program dataset with 200 clients, including demographics, clinical assessments (PHQ-9, GAD-7), and service utilization records.
Step 1: Setting Up Your Environment
First, we will install and import the libraries we need:
# Install python-docx for creating Word documents
!pip install python-docx -q
# Import libraries
import pandas as pd
import numpy as np
from docx import Document
from docx.shared import Pt, Inches
from docx.enum.text import WD_ALIGN_PARAGRAPH
from datetime import datetimeThe !pip install line installs the python-docx package. The exclamation mark tells Colab this is a system command, not Python code. The -q flag means “quiet”—it suppresses most of the installation output to keep your notebook clean.
pandas is Python’s main library for working with tabular data. numpy provides mathematical functions. python-docx lets us create Word documents programmatically. Please read the following post if you are new to pandas library.
Step 2: Loading Data
Most datasets you’ll work with come as CSV (Comma-Separated Values) files. CSV is a plain text format where each row is a line, and columns are separated by commas. It’s the universal format for tabular data because virtually any software can read it.
The basic syntax for loading a CSV:
df = pd.read_csv('client_data.csv')
df.head()Step 3: Exploring Your Data
Before computing codebook statistics, examine what you’ve loaded. This catches problems early.
# Check data types
df.dtypes Data types matter. You’ll see int64 (integers), float64 (decimals), and object (text/strings). Variables that should be numeric but show as object probably have text mixed in—a common data cleaning issue.
# Quick summary statistics
df.describe()The .describe() method shows count, mean, standard deviation, min, max, and quartiles for numeric columns. For text columns, use df.describe(include='object') to see frequency information.
df.describe(include='object')Step 4: Understanding Variable Types
Different variables require different statistics.
Continuous variables can take any numeric value within a range.
- Example: Age in years, PHQ-9 scores (0-27), and session counts are continuous.
- Codebook: We compute mean, median, standard deviation, and range.
Categorical variables represent distinct groups.
- Example: Gender, race/ethnicity, and insurance type are categorical. You’re in one group or another, with no ordering implied.
- Codebook: We report frequency counts and percentages.
Ordinal variables are categorical but with a natural order.
- Example: Education level (less than high school → graduate degree) is ordinal. The categories have a sequence, but the “distance” between them isn’t necessarily equal.
- Codebook: We treat ordinal variables like categorical for a codebook, reporting frequencies rather than means.
For our dataset:
Continuous: age, phq9_baseline, phq9_followup, gad7_baseline, sessions_attended
Categorical: gender, race_ethnicity, primary_language, education, employment_status, service_type, insurance_type
The client_id is an identifier, not really a variable for analysis—we’ll skip it in the codebook.
Step 5: Variable Labels vs. Value Labels
A professional codebook includes two types of labels that help readers understand your data. These concepts come from statistical software like SPSS and Stata, but we can implement them in Python too.
We define each variable with its label, type, and value labels. For details on what variable labels and value labels are, see Understanding Codebook for Data Analysis. Without variable labels, a reader seeing gad7_baseline in your analysis output has to guess what GAD-7 stands for (Generalized Anxiety Disorder 7-item scale) and what the numbers mean. Without value labels, a dataset storing gender as 1/2/3/4 is meaningless unless you document which number corresponds to which category. Even worse, different datasets might use different coding schemes—in one dataset, 1 might mean “Male” while in another it means “Female.”
A well-constructed codebook includes both types of labels for every variable, making your data self-documenting and your analysis reproducible.
Step 6: Building the Statistics Functions
Let’s create functions that compute appropriate statistics based on variable type.
def compute_continuous_stats(series):
"""
Compute summary statistics for a continuous variable.
Parameters:
-----------
series : pandas Series
The variable to summarize
Returns:
--------
dict with statistical summaries
"""
return {
'n_valid': int(series.notna().sum()),
'n_missing': int(series.isna().sum()),
'pct_missing': round(series.isna().mean() * 100, 1),
'mean': round(series.mean(), 2),
'std': round(series.std(), 2),
'median': round(series.median(), 2),
'min': int(series.min()) if series.notna().any() else None,
'max': int(series.max()) if series.notna().any() else None
}
def compute_categorical_stats(series):
"""
Compute frequency distribution for a categorical variable.
Parameters:
-----------
series : pandas Series
The variable to summarize
Returns:
--------
dict with frequency information
"""
value_counts = series.value_counts(dropna=True)
pct_counts = series.value_counts(normalize=True, dropna=True) * 100
categories = []
for value in value_counts.index:
categories.append({
'value': str(value),
'count': int(value_counts[value]),
'percent': round(pct_counts[value], 1)
})
return {
'n_valid': int(series.notna().sum()),
'n_missing': int(series.isna().sum()),
'pct_missing': round(series.isna().mean() * 100, 1),
'categories': categories
}These functions take a pandas Series (a single column) and return dictionaries with the computed statistics. Separating the logic into functions makes the code reusable and easier to test.
Notice that we handle missing values explicitly. The .notna() method identifies non-missing values, and dropna=True in value_counts() excludes missing from the percentages. This is important because “10% missing” means something different from “10% chose this response option.”
Step 7: Defining Variable Metadata with Labels
Based on Step 5, now we create a dictionary that includes both variable labels and value labels for each variable:
# Define variables for the codebook
# Each entry: column_name -> {label, type, value_labels (optional)}
variable_config = {
'age': {
'label': 'Age in years',
'type': 'continuous',
'value_labels': None # Continuous variables don't have value labels
},
'gender': {
'label': 'Gender identity',
'type': 'categorical',
'value_labels': {
'Female': 'Identifies as female',
'Male': 'Identifies as male',
'Non-binary': 'Identifies as non-binary',
'Prefer not to say': 'Declined to respond'
}
},
'race_ethnicity': {
'label': 'Race/Ethnicity (self-reported)',
'type': 'categorical',
'value_labels': {
'White': 'White, non-Hispanic',
'Black or African American': 'Black or African American, non-Hispanic',
'Hispanic or Latino': 'Hispanic or Latino, any race',
'Asian': 'Asian, non-Hispanic',
'Multiracial': 'Two or more races',
'Other': 'Other race/ethnicity'
}
},
'primary_language': {
'label': 'Primary language spoken at home',
'type': 'categorical',
'value_labels': {
'English': 'English',
'Spanish': 'Spanish',
'Korean': 'Korean',
'Mandarin': 'Mandarin Chinese',
'Arabic': 'Arabic',
'Other': 'Other language'
}
},
'education': {
'label': 'Highest education level completed',
'type': 'categorical',
'value_labels': {
'Less than high school': 'Did not complete high school',
'High school diploma/GED': 'High school graduate or equivalent',
'Some college': 'Some college, no degree',
"Bachelor's degree": 'Four-year college degree',
'Graduate degree': 'Master\'s, doctoral, or professional degree'
}
},
'employment_status': {
'label': 'Current employment status',
'type': 'categorical',
'value_labels': {
'Employed full-time': 'Working 35+ hours per week',
'Employed part-time': 'Working less than 35 hours per week',
'Unemployed': 'Not employed, seeking work',
'Disabled': 'Not employed due to disability',
'Retired': 'Retired from work',
'Student': 'Full-time student'
}
},
'phq9_baseline': {
'label': 'PHQ-9 depression score at intake (0-27)',
'type': 'continuous',
'value_labels': {
'0-4': 'Minimal depression',
'5-9': 'Mild depression',
'10-14': 'Moderate depression',
'15-19': 'Moderately severe depression',
'20-27': 'Severe depression'
} # Clinical interpretation ranges, not raw value labels
},
'phq9_followup': {
'label': 'PHQ-9 depression score at follow-up (0-27)',
'type': 'continuous',
'value_labels': {
'0-4': 'Minimal depression',
'5-9': 'Mild depression',
'10-14': 'Moderate depression',
'15-19': 'Moderately severe depression',
'20-27': 'Severe depression'
}
},
'gad7_baseline': {
'label': 'GAD-7 anxiety score at intake (0-21)',
'type': 'continuous',
'value_labels': {
'0-4': 'Minimal anxiety',
'5-9': 'Mild anxiety',
'10-14': 'Moderate anxiety',
'15-21': 'Severe anxiety'
}
},
'sessions_attended': {
'label': 'Number of therapy sessions attended',
'type': 'continuous',
'value_labels': None
},
'service_type': {
'label': 'Primary service received',
'type': 'categorical',
'value_labels': {
'Individual therapy': 'One-on-one therapy sessions',
'Group therapy': 'Therapy in a group setting',
'Case management': 'Care coordination and resource linkage',
'Psychiatric services': 'Medication management with psychiatrist',
'Crisis intervention': 'Emergency mental health services'
}
},
'insurance_type': {
'label': 'Insurance/payment type',
'type': 'categorical',
'value_labels': {
'Medicaid': 'Medicaid (state insurance for low-income)',
'Medicare': 'Medicare (federal insurance for 65+ or disabled)',
'Private insurance': 'Employer-sponsored or individual private plan',
'Uninsured/Self-pay': 'No insurance, paying out of pocket',
'Other public': 'Other government program (VA, CHIP, etc.)'
}
}
}
This configuration dictionary serves as the single source of truth for your codebook. The variable label describes what’s being measured. The value labels explain what each category means, or for clinical scales, provide interpretation guidelines. You can create this dictionary manually using Excel spreadsheet/Python, but LLMs can help generate it from your questionnaire or data dictionary. First, extract variable names and unique values from your data:
# Get variable names
print(df.columns.tolist())
# Get unique values for each categorical variable
for col in df.select_dtypes(include='object').columns:
print(f"{col}: {df[col].unique().tolist()}")Here is a prompt example to create the dictionary command for codebook based on the questionnaries:
Generate a Python dictionary called `variable_config` for creating a codebook.
**Variable names from my data: [add your variable lists from the code above]
**Unique values in categorical variables:**
gender: [add unique values from the code above]
...
**Questionnaire/instrument (if applicable):**
[Paste or attach your survey instrument here]
For each variable, include:
- 'label': human-readable description
- 'type': 'continuous' or 'categorical'
- 'value_labels': dictionary mapping each value to its description
Output as Python dictionary format.Important: The LLM output is a starting point. Since it could hallucinate, you have to verify against your documentation as always for accuracy.
Step 8: Computing All Statistics
Let’s loop through our variable configuration and compute statistics for each:
# Compute statistics for all variables
codebook_entries = []
for var_name, config in variable_config.items():
entry = {
'name': var_name,
'label': config['label'],
'type': config['type'],
'value_labels': config['value_labels']
}
if config['type'] == 'continuous':
stats = compute_continuous_stats(df[var_name])
entry.update(stats)
entry['categories'] = None
else:
stats = compute_categorical_stats(df[var_name])
entry.update(stats)
entry['mean'] = None
entry['std'] = None
entry['median'] = None
entry['min'] = None
entry['max'] = None
codebook_entries.append(entry)
print(f"✓ Processed: {var_name}")
print(f"\nCodebook contains {len(codebook_entries)} variables")The entry.update(stats) line merges the computed statistics into our entry dictionary. After this loop, codebook_entries is a list of dictionaries, each containing complete information about one variable including its labels.
Step 9: Creating the Word Document
Now we turn these statistics into a professional document. The python-docx library lets us create Word files with headers, paragraphs, and tables.
Understanding python-docx Structure. A Word document has a hierarchy: the Document contains paragraphs and tables, paragraphs contain “runs” of text with consistent formatting, and tables contain rows of cells.
# Basic example
doc = Document()
doc.add_heading('Title', level=0)
doc.add_paragraph('This is a paragraph.')
doc.save('example.docx')Here’s the complete function to generate our codebook, now including value labels:
def create_codebook_document(codebook_entries, title, description, output_path):
"""
Generate a professional codebook as a Word document.
Parameters:
-----------
codebook_entries : list of dicts
Variable information from compute functions
title : str
Document title
description : str
Dataset description for the overview section
output_path : str
Where to save the .docx file
"""
doc = Document()
# ===== TITLE =====
title_para = doc.add_heading('Codebook', level=0)
title_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
subtitle = doc.add_paragraph(title)
subtitle.alignment = WD_ALIGN_PARAGRAPH.CENTER
date_para = doc.add_paragraph(f'Generated: {datetime.now().strftime("%B %d, %Y")}')
date_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
doc.add_paragraph() # Spacing
# ===== OVERVIEW =====
doc.add_heading('Overview', level=1)
doc.add_paragraph(description)
# Dataset summary stats
n_obs = codebook_entries[0]['n_valid'] + codebook_entries[0]['n_missing']
n_vars = len(codebook_entries)
summary_text = f"The dataset contains {n_obs:,} observations and {n_vars} variables."
doc.add_paragraph(summary_text)
# ===== VARIABLE SUMMARY TABLE =====
doc.add_heading('Variable Summary', level=1)
summary_table = doc.add_table(rows=1, cols=4)
summary_table.style = 'Table Grid'
# Header row
headers = ['Variable', 'Label', 'Type', 'Missing']
header_cells = summary_table.rows[0].cells
for i, header in enumerate(headers):
header_cells[i].text = header
for run in header_cells[i].paragraphs[0].runs:
run.bold = True
# Data rows
for entry in codebook_entries:
row_cells = summary_table.add_row().cells
row_cells[0].text = entry['name']
row_cells[1].text = entry['label']
row_cells[2].text = entry['type'].capitalize()
row_cells[3].text = f"{entry['pct_missing']}%"
# Page break before detailed section
doc.add_page_break()
# ===== DETAILED VARIABLE DESCRIPTIONS =====
doc.add_heading('Detailed Variable Descriptions', level=1)
for entry in codebook_entries:
# Variable heading
doc.add_heading(f"{entry['name']}", level=2)
# Variable label (the human-readable description)
label_para = doc.add_paragraph()
label_run = label_para.add_run('Variable Label: ')
label_run.bold = True
label_para.add_run(entry['label'])
# Basic metadata
meta_text = (
f"Type: {entry['type'].capitalize()} | "
f"Valid N: {entry['n_valid']:,} | "
f"Missing: {entry['n_missing']:,} ({entry['pct_missing']}%)"
)
meta_para = doc.add_paragraph(meta_text)
meta_para.runs[0].italic = True
if entry['type'] == 'continuous':
# Statistics for continuous variables
stats_text = (
f"Mean: {entry['mean']} | "
f"SD: {entry['std']} | "
f"Median: {entry['median']} | "
f"Range: {entry['min']} – {entry['max']}"
)
doc.add_paragraph(stats_text)
# Value labels for continuous (clinical interpretation ranges)
if entry['value_labels']:
doc.add_paragraph()
interp_heading = doc.add_paragraph()
interp_run = interp_heading.add_run('Clinical Interpretation:')
interp_run.bold = True
interp_table = doc.add_table(rows=1, cols=2)
interp_table.style = 'Table Grid'
interp_table.rows[0].cells[0].text = 'Score Range'
interp_table.rows[0].cells[1].text = 'Interpretation'
for run in interp_table.rows[0].cells[0].paragraphs[0].runs:
run.bold = True
for run in interp_table.rows[0].cells[1].paragraphs[0].runs:
run.bold = True
for range_val, interpretation in entry['value_labels'].items():
row = interp_table.add_row().cells
row[0].text = range_val
row[1].text = interpretation
else:
# Frequency table for categorical variables with value labels
if entry['categories']:
doc.add_paragraph()
freq_heading = doc.add_paragraph()
freq_run = freq_heading.add_run('Value Labels and Frequencies:')
freq_run.bold = True
# Determine if we have value labels to add a description column
has_value_labels = entry['value_labels'] is not None
n_cols = 4 if has_value_labels else 3
freq_table = doc.add_table(rows=1, cols=n_cols)
freq_table.style = 'Table Grid'
# Header
freq_headers = ['Value', 'Description', 'Count', 'Percent'] if has_value_labels else ['Value', 'Count', 'Percent']
freq_header_cells = freq_table.rows[0].cells
for i, h in enumerate(freq_headers):
freq_header_cells[i].text = h
for run in freq_header_cells[i].paragraphs[0].runs:
run.bold = True
# Data rows
for cat in entry['categories']:
row_cells = freq_table.add_row().cells
row_cells[0].text = cat['value']
if has_value_labels:
# Get the value label description
description = entry['value_labels'].get(cat['value'], '')
row_cells[1].text = description
row_cells[2].text = f"{cat['count']:,}"
row_cells[3].text = f"{cat['percent']}%"
else:
row_cells[1].text = f"{cat['count']:,}"
row_cells[2].text = f"{cat['percent']}%"
doc.add_paragraph() # Spacing between variables
# ===== SAVE =====
doc.save(output_path)
print(f"✓ Codebook saved to: {output_path}")Now let’s generate the document:
# Create the codebook
output_path = 'mental_health_codebook.docx'
description = (
"This codebook documents variables from a community mental health program "
"evaluation. Data were collected from clients receiving services between "
"January and December 2024. The dataset includes demographic information, "
"clinical assessments (PHQ-9 for depression, GAD-7 for anxiety), and "
"service utilization records."
)
create_codebook_document(
codebook_entries=codebook_entries,
title='Community Mental Health Program\nClient Services Dataset 2024',
description=description,
output_path=output_path
)The Complete Code
Here’s everything assembled into a single workflow you can copy and run:
# ================================================================
# CODEBOOK GENERATOR
# Community Mental Health Program Evaluation
# ================================================================
# ----- SETUP -----
from google.colab import drive
drive.mount('/content/drive')
!pip install python-docx -q
import pandas as pd
import numpy as np
from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
from datetime import datetime
# ----- LOAD DATA -----
df = pd.read_csv('/content/drive/MyDrive/client_data.csv')
print(f"Dataset: {df.shape[0]} clients, {df.shape[1]} variables")
# ----- STATISTICS FUNCTIONS -----
def compute_continuous_stats(series):
return {
'n_valid': int(series.notna().sum()),
'n_missing': int(series.isna().sum()),
'pct_missing': round(series.isna().mean() * 100, 1),
'mean': round(series.mean(), 2),
'std': round(series.std(), 2),
'median': round(series.median(), 2),
'min': int(series.min()) if series.notna().any() else None,
'max': int(series.max()) if series.notna().any() else None
}
def compute_categorical_stats(series):
vc = series.value_counts(dropna=True)
pct = series.value_counts(normalize=True, dropna=True) * 100
categories = [{'value': str(v), 'count': int(vc[v]), 'percent': round(pct[v], 1)}
for v in vc.index]
return {
'n_valid': int(series.notna().sum()),
'n_missing': int(series.isna().sum()),
'pct_missing': round(series.isna().mean() * 100, 1),
'categories': categories
}
# ----- VARIABLE CONFIGURATION WITH LABELS -----
variable_config = {
'age': {
'label': 'Age in years',
'type': 'continuous',
'value_labels': None
},
'gender': {
'label': 'Gender identity',
'type': 'categorical',
'value_labels': {
'Female': 'Identifies as female',
'Male': 'Identifies as male',
'Non-binary': 'Identifies as non-binary',
'Prefer not to say': 'Declined to respond'
}
},
'race_ethnicity': {
'label': 'Race/Ethnicity (self-reported)',
'type': 'categorical',
'value_labels': {
'White': 'White, non-Hispanic',
'Black or African American': 'Black or African American, non-Hispanic',
'Hispanic or Latino': 'Hispanic or Latino, any race',
'Asian': 'Asian, non-Hispanic',
'Multiracial': 'Two or more races',
'Other': 'Other race/ethnicity'
}
},
'primary_language': {
'label': 'Primary language spoken at home',
'type': 'categorical',
'value_labels': {
'English': 'English',
'Spanish': 'Spanish',
'Korean': 'Korean',
'Mandarin': 'Mandarin Chinese',
'Arabic': 'Arabic',
'Other': 'Other language'
}
},
'education': {
'label': 'Highest education level completed',
'type': 'categorical',
'value_labels': {
'Less than high school': 'Did not complete high school',
'High school diploma/GED': 'High school graduate or equivalent',
'Some college': 'Some college, no degree',
"Bachelor's degree": 'Four-year college degree',
'Graduate degree': "Master's, doctoral, or professional degree"
}
},
'employment_status': {
'label': 'Current employment status',
'type': 'categorical',
'value_labels': {
'Employed full-time': 'Working 35+ hours per week',
'Employed part-time': 'Working less than 35 hours per week',
'Unemployed': 'Not employed, seeking work',
'Disabled': 'Not employed due to disability',
'Retired': 'Retired from work',
'Student': 'Full-time student'
}
},
'phq9_baseline': {
'label': 'PHQ-9 depression score at intake (0-27)',
'type': 'continuous',
'value_labels': {
'0-4': 'Minimal depression',
'5-9': 'Mild depression',
'10-14': 'Moderate depression',
'15-19': 'Moderately severe depression',
'20-27': 'Severe depression'
}
},
'phq9_followup': {
'label': 'PHQ-9 depression score at follow-up (0-27)',
'type': 'continuous',
'value_labels': {
'0-4': 'Minimal depression',
'5-9': 'Mild depression',
'10-14': 'Moderate depression',
'15-19': 'Moderately severe depression',
'20-27': 'Severe depression'
}
},
'gad7_baseline': {
'label': 'GAD-7 anxiety score at intake (0-21)',
'type': 'continuous',
'value_labels': {
'0-4': 'Minimal anxiety',
'5-9': 'Mild anxiety',
'10-14': 'Moderate anxiety',
'15-21': 'Severe anxiety'
}
},
'sessions_attended': {
'label': 'Number of therapy sessions attended',
'type': 'continuous',
'value_labels': None
},
'service_type': {
'label': 'Primary service received',
'type': 'categorical',
'value_labels': {
'Individual therapy': 'One-on-one therapy sessions',
'Group therapy': 'Therapy in a group setting',
'Case management': 'Care coordination and resource linkage',
'Psychiatric services': 'Medication management with psychiatrist',
'Crisis intervention': 'Emergency mental health services'
}
},
'insurance_type': {
'label': 'Insurance/payment type',
'type': 'categorical',
'value_labels': {
'Medicaid': 'Medicaid (state insurance for low-income)',
'Medicare': 'Medicare (federal insurance for 65+ or disabled)',
'Private insurance': 'Employer-sponsored or individual private plan',
'Uninsured/Self-pay': 'No insurance, paying out of pocket',
'Other public': 'Other government program (VA, CHIP, etc.)'
}
}
}
# ----- COMPUTE STATISTICS -----
codebook_entries = []
for var_name, config in variable_config.items():
entry = {
'name': var_name,
'label': config['label'],
'type': config['type'],
'value_labels': config['value_labels']
}
if config['type'] == 'continuous':
entry.update(compute_continuous_stats(df[var_name]))
entry['categories'] = None
else:
entry.update(compute_categorical_stats(df[var_name]))
entry['mean'] = entry['std'] = entry['median'] = None
entry['min'] = entry['max'] = None
codebook_entries.append(entry)
print(f"✓ {var_name}")
# ----- CREATE WORD DOCUMENT -----
def create_codebook_document(entries, title, description, output_path):
doc = Document()
# Title page
doc.add_heading('Codebook', level=0).alignment = WD_ALIGN_PARAGRAPH.CENTER
doc.add_paragraph(title).alignment = WD_ALIGN_PARAGRAPH.CENTER
doc.add_paragraph(f'Generated: {datetime.now().strftime("%B %d, %Y")}').alignment = WD_ALIGN_PARAGRAPH.CENTER
doc.add_paragraph()
# Overview
doc.add_heading('Overview', level=1)
doc.add_paragraph(description)
n_obs = entries[0]['n_valid'] + entries[0]['n_missing']
doc.add_paragraph(f"The dataset contains {n_obs:,} observations and {len(entries)} variables.")
# Summary table
doc.add_heading('Variable Summary', level=1)
table = doc.add_table(rows=1, cols=4)
table.style = 'Table Grid'
for i, h in enumerate(['Variable', 'Label', 'Type', 'Missing']):
table.rows[0].cells[i].text = h
table.rows[0].cells[i].paragraphs[0].runs[0].bold = True
for e in entries:
row = table.add_row().cells
row[0].text, row[1].text = e['name'], e['label']
row[2].text, row[3].text = e['type'].capitalize(), f"{e['pct_missing']}%"
doc.add_page_break()
# Detailed descriptions
doc.add_heading('Detailed Variable Descriptions', level=1)
for e in entries:
doc.add_heading(e['name'], level=2)
# Variable label
label_para = doc.add_paragraph()
label_para.add_run('Variable Label: ').bold = True
label_para.add_run(e['label'])
# Metadata
meta = f"Type: {e['type'].capitalize()} | Valid N: {e['n_valid']:,} | Missing: {e['n_missing']:,} ({e['pct_missing']}%)"
doc.add_paragraph(meta).runs[0].italic = True
if e['type'] == 'continuous':
doc.add_paragraph(f"Mean: {e['mean']} | SD: {e['std']} | Median: {e['median']} | Range: {e['min']} – {e['max']}")
# Clinical interpretation for scales
if e['value_labels']:
doc.add_paragraph()
doc.add_paragraph().add_run('Clinical Interpretation:').bold = True
it = doc.add_table(rows=1, cols=2)
it.style = 'Table Grid'
it.rows[0].cells[0].text = 'Score Range'
it.rows[0].cells[1].text = 'Interpretation'
it.rows[0].cells[0].paragraphs[0].runs[0].bold = True
it.rows[0].cells[1].paragraphs[0].runs[0].bold = True
for rng, interp in e['value_labels'].items():
r = it.add_row().cells
r[0].text, r[1].text = rng, interp
elif e['categories']:
doc.add_paragraph()
doc.add_paragraph().add_run('Value Labels and Frequencies:').bold = True
has_vl = e['value_labels'] is not None
ft = doc.add_table(rows=1, cols=4 if has_vl else 3)
ft.style = 'Table Grid'
hdrs = ['Value', 'Description', 'Count', 'Percent'] if has_vl else ['Value', 'Count', 'Percent']
for i, h in enumerate(hdrs):
ft.rows[0].cells[i].text = h
ft.rows[0].cells[i].paragraphs[0].runs[0].bold = True
for cat in e['categories']:
r = ft.add_row().cells
r[0].text = cat['value']
if has_vl:
r[1].text = e['value_labels'].get(cat['value'], '')
r[2].text, r[3].text = f"{cat['count']:,}", f"{cat['percent']}%"
else:
r[1].text, r[2].text = f"{cat['count']:,}", f"{cat['percent']}%"
doc.add_paragraph()
doc.save(output_path)
print(f"\n✓ Codebook saved: {output_path}")
# ----- GENERATE -----
output_path = '/content/drive/MyDrive/mental_health_codebook.docx'
description = (
"This codebook documents variables from a community mental health program "
"evaluation. Data were collected from clients receiving services between "
"January and December 2024. Variables include demographics, clinical "
"assessments (PHQ-9, GAD-7), and service utilization."
)
create_codebook_document(codebook_entries,
'Community Mental Health Program\nClient Services Dataset 2024',
description, output_path)This code will generate the codebook document in MS word format as follws:


Troubleshooting
“KeyError: variable not found”: The variable name in your configuration doesn’t match the DataFrame column. Check spelling and capitalization with df.columns.tolist().
Value labels don’t match actual data values: If your data has values that aren’t in your value_labels dictionary, those rows will show empty descriptions. Make sure every possible value in your data has a corresponding label.
Word document won’t open: Corrupted XML from python-docx is rare but possible. Try a simpler document first to isolate the problem. Make sure all table rows have the correct number of cells.
Missing data percentages seem wrong: Remember that categorical percentages exclude missing values (they show distribution among valid responses only). This is standard practice but worth noting in your codebook.
Bonus: Using LLMs to Draft a Complete Codebook
Instead of manually writing the variable_config dictionary and statistics functions, you can ask an LLM to generate the complete Python code. First, extract your data structure:
# Generate summary for LLM
print("=== DATA SUMMARY FOR CODEBOOK ===\n")
print(f"Dataset: {df.shape[0]} rows, {df.shape[1]} columns\n")
for col in df.columns:
print(f"--- {col} ---")
print(f"Type: {df[col].dtype}")
print(f"Missing: {df[col].isna().sum()} ({df[col].isna().mean()*100:.1f}%)")
if df[col].dtype in ['int64', 'float64']:
print(f"Mean: {df[col].mean():.2f}, SD: {df[col].std():.2f}")
print(f"Range: {df[col].min()} - {df[col].max()}")
else:
print(f"Values: {df[col].value_counts().to_dict()}")
print()Here is an example prompt with the information that is extacted above:
Prompt:
Write Python code that creates a codebook Word document for my dataset.
**Data structure:**
Filename: client_data.csv
Shape: 200 rows, 13 columns
Columns and unique values:
[Copy and paste the output from the above code]
**Questionnaire/instruments:**
[Add further information about specific scales, if applicable]
**Requirements:**
1. Load data from CSV
2. Define variable_config with labels and value_labels for all variables
3. Compute descriptive statistics (mean/SD/median/range for continuous, frequencies for categorical)
4. Export to Word document using python-docx with:
- Title page
- Variable summary table
- Detailed variable descriptions with statistics
Write complete, runnable code for Google Colab.
The LLM will generate the full codes customized to your specific variables and value labels.
Resources
python-docx Documentation: https://python-docx.readthedocs.io/
