[Python] Creating/writing a codebook in Python using pandas and python-docx

This tutorial walks through creating a codebook programmatically: loading data, computing summary statistics, and exporting to a Word document. For background on what codebooks are, why they matter, and the concepts of variable labels and value labels, see Understanding Codebook for Data Analysis.

Traditionally, creating a codebook requires manually computing descriptive statistics for each variable in statistical software (SPSS, Stata, Excel), then copying and pasting results into a Word or PDF document. This process is time-consuming and error-prone, especially for datasets with many variables. With Python, we can automate most of this workflow: load data, compute statistics, format tables, and generate a professional document in seconds.

The Scenario: A Community Mental Health Program Evaluation

Imagine you’re evaluating a community mental health program. Your agency has collected data on 200 clients who received services over the past year. The dataset includes demographic information, clinical assessments, and service utilization records. Before analyzing questions like “Did clients improve over time?” or “Which subgroups benefited most?”, you need to document what you have. A codebook answers basic questions: How many clients are in each age group? What’s the distribution of baseline depression scores? How much missing data do we have?

Let’s build that codebook using Python step by step. We will use a mock community mental health program dataset with 200 clients, including demographics, clinical assessments (PHQ-9, GAD-7), and service utilization records.

client_data Download

Step 1: Setting Up Your Environment

First, we will install and import the libraries we need:

Python

# Install python-docx for creating Word documents
!pip install python-docx -q

# Import libraries
import pandas as pd
import numpy as np
from docx import Document
from docx.shared import Pt, Inches
from docx.enum.text import WD_ALIGN_PARAGRAPH
from datetime import datetime

# Install python-docx for creating Word documents
!pip install python-docx -q

# Import libraries
import pandas as pd
import numpy as np
from docx import Document
from docx.shared import Pt, Inches
from docx.enum.text import WD_ALIGN_PARAGRAPH
from datetime import datetime

The !pip install line installs the python-docx package. The exclamation mark tells Colab this is a system command, not Python code. The -q flag means “quiet”—it suppresses most of the installation output to keep your notebook clean.

pandas is Python’s main library for working with tabular data. numpy provides mathematical functions. python-docx lets us create Word documents programmatically. Please read the following post if you are new to pandas library.

[Python] Introduction to pandas library for data analysis

Step 2: Loading Data

Most datasets you’ll work with come as CSV (Comma-Separated Values) files. CSV is a plain text format where each row is a line, and columns are separated by commas. It’s the universal format for tabular data because virtually any software can read it.

The basic syntax for loading a CSV:

Python

df = pd.read_csv('client_data.csv')
df.head()

df = pd.read_csv('client_data.csv')
df.head()

Step 3: Exploring Your Data

Before computing codebook statistics, examine what you’ve loaded. This catches problems early.

Python

# Check data types
df.dtypes

# Check data types
df.dtypes

Data types matter. You’ll see int64 (integers), float64 (decimals), and object (text/strings). Variables that should be numeric but show as object probably have text mixed in—a common data cleaning issue.

Python

# Quick summary statistics
df.describe()

# Quick summary statistics
df.describe()

The .describe() method shows count, mean, standard deviation, min, max, and quartiles for numeric columns. For text columns, use df.describe(include='object') to see frequency information.

Python

df.describe(include='object')

df.describe(include='object')

Step 4: Understanding Variable Types

Different variables require different statistics.

Continuous variables can take any numeric value within a range.

Example: Age in years, PHQ-9 scores (0-27), and session counts are continuous.
Codebook: We compute mean, median, standard deviation, and range.

Categorical variables represent distinct groups.

Example: Gender, race/ethnicity, and insurance type are categorical. You’re in one group or another, with no ordering implied.
Codebook: We report frequency counts and percentages.

Ordinal variables are categorical but with a natural order.

Example: Education level (less than high school → graduate degree) is ordinal. The categories have a sequence, but the “distance” between them isn’t necessarily equal.
Codebook: We treat ordinal variables like categorical for a codebook, reporting frequencies rather than means.

For our dataset:

Continuous: age, phq9_baseline, phq9_followup, gad7_baseline, sessions_attended

Categorical: gender, race_ethnicity, primary_language, education, employment_status, service_type, insurance_type

The client_id is an identifier, not really a variable for analysis—we’ll skip it in the codebook.

Step 5: Variable Labels vs. Value Labels

A professional codebook includes two types of labels that help readers understand your data. These concepts come from statistical software like SPSS and Stata, but we can implement them in Python too.

We define each variable with its label, type, and value labels. For details on what variable labels and value labels are, see Understanding Codebook for Data Analysis. Without variable labels, a reader seeing gad7_baseline in your analysis output has to guess what GAD-7 stands for (Generalized Anxiety Disorder 7-item scale) and what the numbers mean. Without value labels, a dataset storing gender as 1/2/3/4 is meaningless unless you document which number corresponds to which category. Even worse, different datasets might use different coding schemes—in one dataset, 1 might mean “Male” while in another it means “Female.”

A well-constructed codebook includes both types of labels for every variable, making your data self-documenting and your analysis reproducible.

Step 6: Building the Statistics Functions

Let’s create functions that compute appropriate statistics based on variable type.

Python

def compute_continuous_stats(series):
    """
    Compute summary statistics for a continuous variable.
    
    Parameters:
    -----------
    series : pandas Series
        The variable to summarize
    
    Returns:
    --------
    dict with statistical summaries
    """
    num = pd.to_numeric(series, errors='coerce')
    return {
        'n_valid': int(num.notna().sum()),
        'n_missing': int(num.isna().sum()),
        'pct_missing': round(num.isna().mean() * 100, 1),
        'mean': round(num.mean(), 2),
        'std': round(num.std(), 2),
        'median': round(num.median(), 2),
        'min': round(num.min(), 2) if num.notna().any() else None,
        'max': round(num.max(), 2) if num.notna().any() else None
    }


def compute_categorical_stats(series):
    """
    Compute frequency distribution for a categorical variable.
    
    Parameters:
    -----------
    series : pandas Series
        The variable to summarize
    
    Returns:
    --------
    dict with frequency information
    """
    value_counts = series.value_counts(dropna=True)
    pct_counts = series.value_counts(normalize=True, dropna=True) * 100
    
    categories = []
    for value in value_counts.index:
        categories.append({
            'value': str(value),
            'count': int(value_counts[value]),
            'percent': round(pct_counts[value], 1)
        })
    
    return {
        'n_valid': int(series.notna().sum()),
        'n_missing': int(series.isna().sum()),
        'pct_missing': round(series.isna().mean() * 100, 1),
        'categories': categories
    }

def compute_continuous_stats(series):
    """
    Compute summary statistics for a continuous variable.
    
    Parameters:
    -----------
    series : pandas Series
        The variable to summarize
    
    Returns:
    --------
    dict with statistical summaries
    """
    num = pd.to_numeric(series, errors='coerce')
    return {
        'n_valid': int(num.notna().sum()),
        'n_missing': int(num.isna().sum()),
        'pct_missing': round(num.isna().mean() * 100, 1),
        'mean': round(num.mean(), 2),
        'std': round(num.std(), 2),
        'median': round(num.median(), 2),
        'min': round(num.min(), 2) if num.notna().any() else None,
        'max': round(num.max(), 2) if num.notna().any() else None
    }


def compute_categorical_stats(series):
    """
    Compute frequency distribution for a categorical variable.
    
    Parameters:
    -----------
    series : pandas Series
        The variable to summarize
    
    Returns:
    --------
    dict with frequency information
    """
    value_counts = series.value_counts(dropna=True)
    pct_counts = series.value_counts(normalize=True, dropna=True) * 100
    
    categories = []
    for value in value_counts.index:
        categories.append({
            'value': str(value),
            'count': int(value_counts[value]),
            'percent': round(pct_counts[value], 1)
        })
    
    return {
        'n_valid': int(series.notna().sum()),
        'n_missing': int(series.isna().sum()),
        'pct_missing': round(series.isna().mean() * 100, 1),
        'categories': categories
    }

These functions take a pandas Series (a single column) and return dictionaries with the computed statistics. Separating the logic into functions makes the code reusable and easier to test.

Notice that we handle missing values explicitly. The .notna() method identifies non-missing values, and dropna=True in value_counts() excludes missing from the percentages. This is important because “10% missing” means something different from “10% chose this response option.”

Step 7: Defining Variable Metadata with Labels

Based on Step 5, now we create a dictionary that includes both variable labels and value labels for each variable:

Python

# Define variables for the codebook
# Each entry: column_name -> {label, type, value_labels (optional)}

variable_config = {
    'age': {
        'label': 'Age in years',
        'type': 'continuous',
        'value_labels': None  # Continuous variables don't have value labels
    },
    'gender': {
        'label': 'Gender identity',
        'type': 'categorical',
        'value_labels': {
            'Female': 'Identifies as female',
            'Male': 'Identifies as male',
            'Non-binary': 'Identifies as non-binary',
            'Prefer not to say': 'Declined to respond'
        }
    },
    'race_ethnicity': {
        'label': 'Race/Ethnicity (self-reported)',
        'type': 'categorical',
        'value_labels': {
            'White': 'White, non-Hispanic',
            'Black or African American': 'Black or African American, non-Hispanic',
            'Hispanic or Latino': 'Hispanic or Latino, any race',
            'Asian': 'Asian, non-Hispanic',
            'Multiracial': 'Two or more races',
            'Other': 'Other race/ethnicity'
        }
    },
    'primary_language': {
        'label': 'Primary language spoken at home',
        'type': 'categorical',
        'value_labels': {
            'English': 'English',
            'Spanish': 'Spanish',
            'Korean': 'Korean',
            'Mandarin': 'Mandarin Chinese',
            'Arabic': 'Arabic',
            'Other': 'Other language'
        }
    },
    'education': {
        'label': 'Highest education level completed',
        'type': 'categorical',
        'value_labels': {
            'Less than high school': 'Did not complete high school',
            'High school diploma/GED': 'High school graduate or equivalent',
            'Some college': 'Some college, no degree',
            "Bachelor's degree": 'Four-year college degree',
            'Graduate degree': 'Master\'s, doctoral, or professional degree'
        }
    },
    'employment_status': {
        'label': 'Current employment status',
        'type': 'categorical',
        'value_labels': {
            'Employed full-time': 'Working 35+ hours per week',
            'Employed part-time': 'Working less than 35 hours per week',
            'Unemployed': 'Not employed, seeking work',
            'Disabled': 'Not employed due to disability',
            'Retired': 'Retired from work',
            'Student': 'Full-time student'
        }
    },
    'phq9_baseline': {
        'label': 'PHQ-9 depression score at intake (0-27)',
        'type': 'continuous',
        'value_labels': {
            '0-4': 'Minimal depression',
            '5-9': 'Mild depression',
            '10-14': 'Moderate depression',
            '15-19': 'Moderately severe depression',
            '20-27': 'Severe depression'
        }  # Clinical interpretation ranges, not raw value labels
    },
    'phq9_followup': {
        'label': 'PHQ-9 depression score at follow-up (0-27)',
        'type': 'continuous',
        'value_labels': {
            '0-4': 'Minimal depression',
            '5-9': 'Mild depression',
            '10-14': 'Moderate depression',
            '15-19': 'Moderately severe depression',
            '20-27': 'Severe depression'
        }
    },
    'gad7_baseline': {
        'label': 'GAD-7 anxiety score at intake (0-21)',
        'type': 'continuous',
        'value_labels': {
            '0-4': 'Minimal anxiety',
            '5-9': 'Mild anxiety',
            '10-14': 'Moderate anxiety',
            '15-21': 'Severe anxiety'
        }
    },
    'sessions_attended': {
        'label': 'Number of therapy sessions attended',
        'type': 'continuous',
        'value_labels': None
    },
    'service_type': {
        'label': 'Primary service received',
        'type': 'categorical',
        'value_labels': {
            'Individual therapy': 'One-on-one therapy sessions',
            'Group therapy': 'Therapy in a group setting',
            'Case management': 'Care coordination and resource linkage',
            'Psychiatric services': 'Medication management with psychiatrist',
            'Crisis intervention': 'Emergency mental health services'
        }
    },
    'insurance_type': {
        'label': 'Insurance/payment type',
        'type': 'categorical',
        'value_labels': {
            'Medicaid': 'Medicaid (state insurance for low-income)',
            'Medicare': 'Medicare (federal insurance for 65+ or disabled)',
            'Private insurance': 'Employer-sponsored or individual private plan',
            'Uninsured/Self-pay': 'No insurance, paying out of pocket',
            'Other public': 'Other government program (VA, CHIP, etc.)'
        }
    }
}

# Define variables for the codebook
# Each entry: column_name -> {label, type, value_labels (optional)}

variable_config = {
    'age': {
        'label': 'Age in years',
        'type': 'continuous',
        'value_labels': None  # Continuous variables don't have value labels
    },
    'gender': {
        'label': 'Gender identity',
        'type': 'categorical',
        'value_labels': {
            'Female': 'Identifies as female',
            'Male': 'Identifies as male',
            'Non-binary': 'Identifies as non-binary',
            'Prefer not to say': 'Declined to respond'
        }
    },
    'race_ethnicity': {
        'label': 'Race/Ethnicity (self-reported)',
        'type': 'categorical',
        'value_labels': {
            'White': 'White, non-Hispanic',
            'Black or African American': 'Black or African American, non-Hispanic',
            'Hispanic or Latino': 'Hispanic or Latino, any race',
            'Asian': 'Asian, non-Hispanic',
            'Multiracial': 'Two or more races',
            'Other': 'Other race/ethnicity'
        }
    },
    'primary_language': {
        'label': 'Primary language spoken at home',
        'type': 'categorical',
        'value_labels': {
            'English': 'English',
            'Spanish': 'Spanish',
            'Korean': 'Korean',
            'Mandarin': 'Mandarin Chinese',
            'Arabic': 'Arabic',
            'Other': 'Other language'
        }
    },
    'education': {
        'label': 'Highest education level completed',
        'type': 'categorical',
        'value_labels': {
            'Less than high school': 'Did not complete high school',
            'High school diploma/GED': 'High school graduate or equivalent',
            'Some college': 'Some college, no degree',
            "Bachelor's degree": 'Four-year college degree',
            'Graduate degree': 'Master\'s, doctoral, or professional degree'
        }
    },
    'employment_status': {
        'label': 'Current employment status',
        'type': 'categorical',
        'value_labels': {
            'Employed full-time': 'Working 35+ hours per week',
            'Employed part-time': 'Working less than 35 hours per week',
            'Unemployed': 'Not employed, seeking work',
            'Disabled': 'Not employed due to disability',
            'Retired': 'Retired from work',
            'Student': 'Full-time student'
        }
    },
    'phq9_baseline': {
        'label': 'PHQ-9 depression score at intake (0-27)',
        'type': 'continuous',
        'value_labels': {
            '0-4': 'Minimal depression',
            '5-9': 'Mild depression',
            '10-14': 'Moderate depression',
            '15-19': 'Moderately severe depression',
            '20-27': 'Severe depression'
        }  # Clinical interpretation ranges, not raw value labels
    },
    'phq9_followup': {
        'label': 'PHQ-9 depression score at follow-up (0-27)',
        'type': 'continuous',
        'value_labels': {
            '0-4': 'Minimal depression',
            '5-9': 'Mild depression',
            '10-14': 'Moderate depression',
            '15-19': 'Moderately severe depression',
            '20-27': 'Severe depression'
        }
    },
    'gad7_baseline': {
        'label': 'GAD-7 anxiety score at intake (0-21)',
        'type': 'continuous',
        'value_labels': {
            '0-4': 'Minimal anxiety',
            '5-9': 'Mild anxiety',
            '10-14': 'Moderate anxiety',
            '15-21': 'Severe anxiety'
        }
    },
    'sessions_attended': {
        'label': 'Number of therapy sessions attended',
        'type': 'continuous',
        'value_labels': None
    },
    'service_type': {
        'label': 'Primary service received',
        'type': 'categorical',
        'value_labels': {
            'Individual therapy': 'One-on-one therapy sessions',
            'Group therapy': 'Therapy in a group setting',
            'Case management': 'Care coordination and resource linkage',
            'Psychiatric services': 'Medication management with psychiatrist',
            'Crisis intervention': 'Emergency mental health services'
        }
    },
    'insurance_type': {
        'label': 'Insurance/payment type',
        'type': 'categorical',
        'value_labels': {
            'Medicaid': 'Medicaid (state insurance for low-income)',
            'Medicare': 'Medicare (federal insurance for 65+ or disabled)',
            'Private insurance': 'Employer-sponsored or individual private plan',
            'Uninsured/Self-pay': 'No insurance, paying out of pocket',
            'Other public': 'Other government program (VA, CHIP, etc.)'
        }
    }
}

This configuration dictionary serves as the single source of truth for your codebook. The variable label describes what’s being measured. The value labels explain what each category means, or for clinical scales, provide interpretation guidelines. You can create this dictionary manually, but LLMs can help generate it from your data. First, you can try runnig this code to extract your data structure with automatic type detection:

Python

import re

id_pattern = re.compile(r'(^id|_id$|pid$|^idx$|^index$|^key$|^subj|^participant|^respondent)', re.IGNORECASE)

def detect_type(s, col):
    if col in manual_ids or id_pattern.search(col.strip('_')):
        return "id"
    s = s.dropna()
    n = len(s)
    if n == 0:
        return "categorical"
    num = pd.to_numeric(s, errors='coerce')
    nu = num.dropna().nunique() if num.notna().sum() / n > 0.8 else s.nunique()
    if nu / n > 0.8:
        return "id"
    return "continuous" if nu > 20 else "categorical"

variable_config = {c: {'label': c, 'type': (t := detect_type(df[c], c)),
                       'value_labels': {v: str(v) for v in sorted(df[c].dropna().unique(), key=str)} if t == "categorical" else {}}
                   for c in df.columns if detect_type(df[c], c) != "id"}

print(f"Generated config for {len(variable_config)} variables\n")
for c, v in variable_config.items():
    if v['type'] == 'continuous':
        print(f"{c} (continuous): range {df[c].min()} - {df[c].max()}")
    else:
        print(f"{c} (categorical): {list(v['value_labels'].keys())}")

import re

id_pattern = re.compile(r'(^id|_id$|pid$|^idx$|^index$|^key$|^subj|^participant|^respondent)', re.IGNORECASE)

def detect_type(s, col):
    if col in manual_ids or id_pattern.search(col.strip('_')):
        return "id"
    s = s.dropna()
    n = len(s)
    if n == 0:
        return "categorical"
    num = pd.to_numeric(s, errors='coerce')
    nu = num.dropna().nunique() if num.notna().sum() / n > 0.8 else s.nunique()
    if nu / n > 0.8:
        return "id"
    return "continuous" if nu > 20 else "categorical"

variable_config = {c: {'label': c, 'type': (t := detect_type(df[c], c)),
                       'value_labels': {v: str(v) for v in sorted(df[c].dropna().unique(), key=str)} if t == "categorical" else {}}
                   for c in df.columns if detect_type(df[c], c) != "id"}

print(f"Generated config for {len(variable_config)} variables\n")
for c, v in variable_config.items():
    if v['type'] == 'continuous':
        print(f"{c} (continuous): range {df[c].min()} - {df[c].max()}")
    else:
        print(f"{c} (categorical): {list(v['value_labels'].keys())}")

Output example:

=== DATA STRUCTURE ===

client_id (categorical): ['C0001', 'C0002', 'C0003', ...]
age (continuous): range 18 - 75
gender (categorical): ['Female', 'Male', 'Non-binary', 'Prefer not to say']
race_ethnicity (categorical): ['White', 'Black or African American', 'Hispanic or Latino', 'Asian', 'Multiracial', 'Other']
phq9_baseline (continuous): range 0 - 27
...

The logic: string types are always categorical; numeric types with ≤20 unique values or low cardinality (unique/total < 5%) are treated as categorical. Review the output and adjust if needed (e.g., PHQ-9 scores 0-27 may appear categorical but are continuous).

Here is a prompt example to create the dictionary command for codebook based on the questionnaries:

Python

Generate a Python dictionary called `variable_config` for creating a codebook.

**Data structure:**
[Paste the output from the code above]

**Questionnaire/instrument (if applicable):**
[Paste or attach your survey instrument here]

For each variable, include:
- 'label': human-readable description
- 'type': 'continuous' or 'categorical'  
- 'value_labels': dictionary mapping each value to its description

Output as Python dictionary format.

Generate a Python dictionary called `variable_config` for creating a codebook.

**Data structure:**
[Paste the output from the code above]

**Questionnaire/instrument (if applicable):**
[Paste or attach your survey instrument here]

For each variable, include:
- 'label': human-readable description
- 'type': 'continuous' or 'categorical'  
- 'value_labels': dictionary mapping each value to its description

Output as Python dictionary format.

Important: The LLM output is a starting point. Since it could hallucinate, you have to verify against your documentation as always for accuracy.

Step 8: Computing All Statistics

Let’s loop through our variable configuration and compute statistics for each:

Python

# Compute statistics for all variables
codebook_entries = []

for var_name, config in variable_config.items():
    entry = {
        'name': var_name,
        'label': config['label'],
        'type': config['type'],
        'value_labels': config['value_labels']
    }
    
    if config['type'] == 'continuous':
        stats = compute_continuous_stats(df[var_name])
        entry.update(stats)
        entry['categories'] = None
    else:
        stats = compute_categorical_stats(df[var_name])
        entry.update(stats)
        entry['mean'] = None
        entry['std'] = None
        entry['median'] = None
        entry['min'] = None
        entry['max'] = None
    
    codebook_entries.append(entry)
    print(f"✓ Processed: {var_name}")

print(f"\nCodebook contains {len(codebook_entries)} variables")

# Compute statistics for all variables
codebook_entries = []

for var_name, config in variable_config.items():
    entry = {
        'name': var_name,
        'label': config['label'],
        'type': config['type'],
        'value_labels': config['value_labels']
    }
    
    if config['type'] == 'continuous':
        stats = compute_continuous_stats(df[var_name])
        entry.update(stats)
        entry['categories'] = None
    else:
        stats = compute_categorical_stats(df[var_name])
        entry.update(stats)
        entry['mean'] = None
        entry['std'] = None
        entry['median'] = None
        entry['min'] = None
        entry['max'] = None
    
    codebook_entries.append(entry)
    print(f"✓ Processed: {var_name}")

print(f"\nCodebook contains {len(codebook_entries)} variables")

The entry.update(stats) line merges the computed statistics into our entry dictionary. After this loop, codebook_entries is a list of dictionaries, each containing complete information about one variable including its labels.

Step 9: Creating the Word Document

Now we turn these statistics into a professional document. The python-docx library lets us create Word files with headers, paragraphs, and tables.

Understanding python-docx Structure. A Word document has a hierarchy: the Document contains paragraphs and tables, paragraphs contain “runs” of text with consistent formatting, and tables contain rows of cells.

Python

# Basic example
doc = Document()
doc.add_heading('Title', level=0)
doc.add_paragraph('This is a paragraph.')
doc.save('example.docx')

# Basic example
doc = Document()
doc.add_heading('Title', level=0)
doc.add_paragraph('This is a paragraph.')
doc.save('example.docx')

Here’s the complete function to generate our codebook, now including value labels:

Python

def create_codebook_document(codebook_entries, title, description, output_path):
    """
    Generate a professional codebook as a Word document.
    
    Parameters:
    -----------
    codebook_entries : list of dicts
        Variable information from compute functions
    title : str
        Document title
    description : str
        Dataset description for the overview section
    output_path : str
        Where to save the .docx file
    """
    doc = Document()
    
    # ===== TITLE =====
    title_para = doc.add_heading('Codebook', level=0)
    title_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
    
    subtitle = doc.add_paragraph(title)
    subtitle.alignment = WD_ALIGN_PARAGRAPH.CENTER
    
    date_para = doc.add_paragraph(f'Generated: {datetime.now().strftime("%B %d, %Y")}')
    date_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
    
    doc.add_paragraph()  # Spacing
    
    # ===== OVERVIEW =====
    doc.add_heading('Overview', level=1)
    doc.add_paragraph(description)
    
    # Dataset summary stats
    n_obs = codebook_entries[0]['n_valid'] + codebook_entries[0]['n_missing']
    n_vars = len(codebook_entries)
    summary_text = f"The dataset contains {n_obs:,} observations and {n_vars} variables."
    doc.add_paragraph(summary_text)
    
    # ===== VARIABLE SUMMARY TABLE =====
    doc.add_heading('Variable Summary', level=1)
    
    summary_table = doc.add_table(rows=1, cols=4)
    summary_table.style = 'Table Grid'
    
    # Header row
    headers = ['Variable', 'Label', 'Type', 'Missing']
    header_cells = summary_table.rows[0].cells
    for i, header in enumerate(headers):
        header_cells[i].text = header
        for run in header_cells[i].paragraphs[0].runs:
            run.bold = True
    
    # Data rows
    for entry in codebook_entries:
        row_cells = summary_table.add_row().cells
        row_cells[0].text = entry['name']
        row_cells[1].text = entry['label']
        row_cells[2].text = entry['type'].capitalize()
        row_cells[3].text = f"{entry['pct_missing']}%"
    
    # Page break before detailed section
    doc.add_page_break()
    
    # ===== DETAILED VARIABLE DESCRIPTIONS =====
    doc.add_heading('Detailed Variable Descriptions', level=1)
    
    for entry in codebook_entries:
        # Variable heading
        doc.add_heading(f"{entry['name']}", level=2)
        
        # Variable label (the human-readable description)
        label_para = doc.add_paragraph()
        label_run = label_para.add_run('Variable Label: ')
        label_run.bold = True
        label_para.add_run(entry['label'])
        
        # Basic metadata
        meta_text = (
            f"Type: {entry['type'].capitalize()} | "
            f"Valid N: {entry['n_valid']:,} | "
            f"Missing: {entry['n_missing']:,} ({entry['pct_missing']}%)"
        )
        meta_para = doc.add_paragraph(meta_text)
        meta_para.runs[0].italic = True
        
        if entry['type'] == 'continuous':
            # Statistics for continuous variables
            stats_text = (
                f"Mean: {entry['mean']} | "
                f"SD: {entry['std']} | "
                f"Median: {entry['median']} | "
                f"Range: {entry['min']} – {entry['max']}"
            )
            doc.add_paragraph(stats_text)
            
            # Value labels for continuous (clinical interpretation ranges)
            if entry['value_labels']:
                doc.add_paragraph()
                interp_heading = doc.add_paragraph()
                interp_run = interp_heading.add_run('Clinical Interpretation:')
                interp_run.bold = True
                
                interp_table = doc.add_table(rows=1, cols=2)
                interp_table.style = 'Table Grid'
                interp_table.rows[0].cells[0].text = 'Score Range'
                interp_table.rows[0].cells[1].text = 'Interpretation'
                for run in interp_table.rows[0].cells[0].paragraphs[0].runs:
                    run.bold = True
                for run in interp_table.rows[0].cells[1].paragraphs[0].runs:
                    run.bold = True
                
                for range_val, interpretation in entry['value_labels'].items():
                    row = interp_table.add_row().cells
                    row[0].text = range_val
                    row[1].text = interpretation
            
        else:
            # Frequency table for categorical variables with value labels
            if entry['categories']:
                doc.add_paragraph()
                freq_heading = doc.add_paragraph()
                freq_run = freq_heading.add_run('Value Labels and Frequencies:')
                freq_run.bold = True
                
                # Determine if we have value labels to add a description column
                has_value_labels = entry['value_labels'] is not None
                n_cols = 4 if has_value_labels else 3
                
                freq_table = doc.add_table(rows=1, cols=n_cols)
                freq_table.style = 'Table Grid'
                
                # Header
                freq_headers = ['Value', 'Description', 'Count', 'Percent'] if has_value_labels else ['Value', 'Count', 'Percent']
                freq_header_cells = freq_table.rows[0].cells
                for i, h in enumerate(freq_headers):
                    freq_header_cells[i].text = h
                    for run in freq_header_cells[i].paragraphs[0].runs:
                        run.bold = True
                
                # Data rows
                for cat in entry['categories']:
                    row_cells = freq_table.add_row().cells
                    row_cells[0].text = cat['value']
                    if has_value_labels:
                        # Get the value label description
                        description = entry['value_labels'].get(cat['value'], '')
                        row_cells[1].text = description
                        row_cells[2].text = f"{cat['count']:,}"
                        row_cells[3].text = f"{cat['percent']}%"
                    else:
                        row_cells[1].text = f"{cat['count']:,}"
                        row_cells[2].text = f"{cat['percent']}%"
        
        doc.add_paragraph()  # Spacing between variables
    
    # ===== SAVE =====
    doc.save(output_path)
    print(f"✓ Codebook saved to: {output_path}")

def create_codebook_document(codebook_entries, title, description, output_path):
    """
    Generate a professional codebook as a Word document.
    
    Parameters:
    -----------
    codebook_entries : list of dicts
        Variable information from compute functions
    title : str
        Document title
    description : str
        Dataset description for the overview section
    output_path : str
        Where to save the .docx file
    """
    doc = Document()
    
    # ===== TITLE =====
    title_para = doc.add_heading('Codebook', level=0)
    title_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
    
    subtitle = doc.add_paragraph(title)
    subtitle.alignment = WD_ALIGN_PARAGRAPH.CENTER
    
    date_para = doc.add_paragraph(f'Generated: {datetime.now().strftime("%B %d, %Y")}')
    date_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
    
    doc.add_paragraph()  # Spacing
    
    # ===== OVERVIEW =====
    doc.add_heading('Overview', level=1)
    doc.add_paragraph(description)
    
    # Dataset summary stats
    n_obs = codebook_entries[0]['n_valid'] + codebook_entries[0]['n_missing']
    n_vars = len(codebook_entries)
    summary_text = f"The dataset contains {n_obs:,} observations and {n_vars} variables."
    doc.add_paragraph(summary_text)
    
    # ===== VARIABLE SUMMARY TABLE =====
    doc.add_heading('Variable Summary', level=1)
    
    summary_table = doc.add_table(rows=1, cols=4)
    summary_table.style = 'Table Grid'
    
    # Header row
    headers = ['Variable', 'Label', 'Type', 'Missing']
    header_cells = summary_table.rows[0].cells
    for i, header in enumerate(headers):
        header_cells[i].text = header
        for run in header_cells[i].paragraphs[0].runs:
            run.bold = True
    
    # Data rows
    for entry in codebook_entries:
        row_cells = summary_table.add_row().cells
        row_cells[0].text = entry['name']
        row_cells[1].text = entry['label']
        row_cells[2].text = entry['type'].capitalize()
        row_cells[3].text = f"{entry['pct_missing']}%"
    
    # Page break before detailed section
    doc.add_page_break()
    
    # ===== DETAILED VARIABLE DESCRIPTIONS =====
    doc.add_heading('Detailed Variable Descriptions', level=1)
    
    for entry in codebook_entries:
        # Variable heading
        doc.add_heading(f"{entry['name']}", level=2)
        
        # Variable label (the human-readable description)
        label_para = doc.add_paragraph()
        label_run = label_para.add_run('Variable Label: ')
        label_run.bold = True
        label_para.add_run(entry['label'])
        
        # Basic metadata
        meta_text = (
            f"Type: {entry['type'].capitalize()} | "
            f"Valid N: {entry['n_valid']:,} | "
            f"Missing: {entry['n_missing']:,} ({entry['pct_missing']}%)"
        )
        meta_para = doc.add_paragraph(meta_text)
        meta_para.runs[0].italic = True
        
        if entry['type'] == 'continuous':
            # Statistics for continuous variables
            stats_text = (
                f"Mean: {entry['mean']} | "
                f"SD: {entry['std']} | "
                f"Median: {entry['median']} | "
                f"Range: {entry['min']} – {entry['max']}"
            )
            doc.add_paragraph(stats_text)
            
            # Value labels for continuous (clinical interpretation ranges)
            if entry['value_labels']:
                doc.add_paragraph()
                interp_heading = doc.add_paragraph()
                interp_run = interp_heading.add_run('Clinical Interpretation:')
                interp_run.bold = True
                
                interp_table = doc.add_table(rows=1, cols=2)
                interp_table.style = 'Table Grid'
                interp_table.rows[0].cells[0].text = 'Score Range'
                interp_table.rows[0].cells[1].text = 'Interpretation'
                for run in interp_table.rows[0].cells[0].paragraphs[0].runs:
                    run.bold = True
                for run in interp_table.rows[0].cells[1].paragraphs[0].runs:
                    run.bold = True
                
                for range_val, interpretation in entry['value_labels'].items():
                    row = interp_table.add_row().cells
                    row[0].text = range_val
                    row[1].text = interpretation
            
        else:
            # Frequency table for categorical variables with value labels
            if entry['categories']:
                doc.add_paragraph()
                freq_heading = doc.add_paragraph()
                freq_run = freq_heading.add_run('Value Labels and Frequencies:')
                freq_run.bold = True
                
                # Determine if we have value labels to add a description column
                has_value_labels = entry['value_labels'] is not None
                n_cols = 4 if has_value_labels else 3
                
                freq_table = doc.add_table(rows=1, cols=n_cols)
                freq_table.style = 'Table Grid'
                
                # Header
                freq_headers = ['Value', 'Description', 'Count', 'Percent'] if has_value_labels else ['Value', 'Count', 'Percent']
                freq_header_cells = freq_table.rows[0].cells
                for i, h in enumerate(freq_headers):
                    freq_header_cells[i].text = h
                    for run in freq_header_cells[i].paragraphs[0].runs:
                        run.bold = True
                
                # Data rows
                for cat in entry['categories']:
                    row_cells = freq_table.add_row().cells
                    row_cells[0].text = cat['value']
                    if has_value_labels:
                        # Get the value label description
                        description = entry['value_labels'].get(cat['value'], '')
                        row_cells[1].text = description
                        row_cells[2].text = f"{cat['count']:,}"
                        row_cells[3].text = f"{cat['percent']}%"
                    else:
                        row_cells[1].text = f"{cat['count']:,}"
                        row_cells[2].text = f"{cat['percent']}%"
        
        doc.add_paragraph()  # Spacing between variables
    
    # ===== SAVE =====
    doc.save(output_path)
    print(f"✓ Codebook saved to: {output_path}")

Now let’s generate the document:

Python

doc = Document()

# Create the codebook
output_path = 'mental_health_codebook.docx'

description = (
    "This codebook documents variables from a community mental health program "
    "evaluation. Data were collected from clients receiving services between "
    "January and December 2024. The dataset includes demographic information, "
    "clinical assessments (PHQ-9 for depression, GAD-7 for anxiety), and "
    "service utilization records."
)

create_codebook_document(
    codebook_entries=codebook_entries,
    title='Community Mental Health Program\nClient Services Dataset 2024',
    description=description,
    output_path=output_path
)

doc = Document()

# Create the codebook
output_path = 'mental_health_codebook.docx'

description = (
    "This codebook documents variables from a community mental health program "
    "evaluation. Data were collected from clients receiving services between "
    "January and December 2024. The dataset includes demographic information, "
    "clinical assessments (PHQ-9 for depression, GAD-7 for anxiety), and "
    "service utilization records."
)

create_codebook_document(
    codebook_entries=codebook_entries,
    title='Community Mental Health Program\nClient Services Dataset 2024',
    description=description,
    output_path=output_path
)

The Complete Code

Here’s everything assembled into a single workflow you can copy and run:

Python

# ================================================================
# CODEBOOK GENERATOR
# Community Mental Health Program Evaluation
# ================================================================

# ----- SETUP -----
from google.colab import drive
drive.mount('/content/drive')

!pip install python-docx -q

import pandas as pd
import numpy as np
from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
from datetime import datetime

# ----- LOAD DATA -----
df = pd.read_csv('/content/drive/MyDrive/client_data.csv')
print(f"Dataset: {df.shape[0]} clients, {df.shape[1]} variables")

# ----- STATISTICS FUNCTIONS -----
def compute_continuous_stats(series):
    return {
        'n_valid': int(series.notna().sum()),
        'n_missing': int(series.isna().sum()),
        'pct_missing': round(series.isna().mean() * 100, 1),
        'mean': round(series.mean(), 2),
        'std': round(series.std(), 2),
        'median': round(series.median(), 2),
        'min': int(series.min()) if series.notna().any() else None,
        'max': int(series.max()) if series.notna().any() else None
    }

def compute_categorical_stats(series):
    vc = series.value_counts(dropna=True)
    pct = series.value_counts(normalize=True, dropna=True) * 100
    categories = [{'value': str(v), 'count': int(vc[v]), 'percent': round(pct[v], 1)} 
                  for v in vc.index]
    return {
        'n_valid': int(series.notna().sum()),
        'n_missing': int(series.isna().sum()),
        'pct_missing': round(series.isna().mean() * 100, 1),
        'categories': categories
    }

# ----- VARIABLE CONFIGURATION WITH LABELS -----
variable_config = {
    'age': {
        'label': 'Age in years',
        'type': 'continuous',
        'value_labels': None
    },
    'gender': {
        'label': 'Gender identity',
        'type': 'categorical',
        'value_labels': {
            'Female': 'Identifies as female',
            'Male': 'Identifies as male',
            'Non-binary': 'Identifies as non-binary',
            'Prefer not to say': 'Declined to respond'
        }
    },
    'race_ethnicity': {
        'label': 'Race/Ethnicity (self-reported)',
        'type': 'categorical',
        'value_labels': {
            'White': 'White, non-Hispanic',
            'Black or African American': 'Black or African American, non-Hispanic',
            'Hispanic or Latino': 'Hispanic or Latino, any race',
            'Asian': 'Asian, non-Hispanic',
            'Multiracial': 'Two or more races',
            'Other': 'Other race/ethnicity'
        }
    },
    'primary_language': {
        'label': 'Primary language spoken at home',
        'type': 'categorical',
        'value_labels': {
            'English': 'English',
            'Spanish': 'Spanish',
            'Korean': 'Korean',
            'Mandarin': 'Mandarin Chinese',
            'Arabic': 'Arabic',
            'Other': 'Other language'
        }
    },
    'education': {
        'label': 'Highest education level completed',
        'type': 'categorical',
        'value_labels': {
            'Less than high school': 'Did not complete high school',
            'High school diploma/GED': 'High school graduate or equivalent',
            'Some college': 'Some college, no degree',
            "Bachelor's degree": 'Four-year college degree',
            'Graduate degree': "Master's, doctoral, or professional degree"
        }
    },
    'employment_status': {
        'label': 'Current employment status',
        'type': 'categorical',
        'value_labels': {
            'Employed full-time': 'Working 35+ hours per week',
            'Employed part-time': 'Working less than 35 hours per week',
            'Unemployed': 'Not employed, seeking work',
            'Disabled': 'Not employed due to disability',
            'Retired': 'Retired from work',
            'Student': 'Full-time student'
        }
    },
    'phq9_baseline': {
        'label': 'PHQ-9 depression score at intake (0-27)',
        'type': 'continuous',
        'value_labels': {
            '0-4': 'Minimal depression',
            '5-9': 'Mild depression',
            '10-14': 'Moderate depression',
            '15-19': 'Moderately severe depression',
            '20-27': 'Severe depression'
        }
    },
    'phq9_followup': {
        'label': 'PHQ-9 depression score at follow-up (0-27)',
        'type': 'continuous',
        'value_labels': {
            '0-4': 'Minimal depression',
            '5-9': 'Mild depression',
            '10-14': 'Moderate depression',
            '15-19': 'Moderately severe depression',
            '20-27': 'Severe depression'
        }
    },
    'gad7_baseline': {
        'label': 'GAD-7 anxiety score at intake (0-21)',
        'type': 'continuous',
        'value_labels': {
            '0-4': 'Minimal anxiety',
            '5-9': 'Mild anxiety',
            '10-14': 'Moderate anxiety',
            '15-21': 'Severe anxiety'
        }
    },
    'sessions_attended': {
        'label': 'Number of therapy sessions attended',
        'type': 'continuous',
        'value_labels': None
    },
    'service_type': {
        'label': 'Primary service received',
        'type': 'categorical',
        'value_labels': {
            'Individual therapy': 'One-on-one therapy sessions',
            'Group therapy': 'Therapy in a group setting',
            'Case management': 'Care coordination and resource linkage',
            'Psychiatric services': 'Medication management with psychiatrist',
            'Crisis intervention': 'Emergency mental health services'
        }
    },
    'insurance_type': {
        'label': 'Insurance/payment type',
        'type': 'categorical',
        'value_labels': {
            'Medicaid': 'Medicaid (state insurance for low-income)',
            'Medicare': 'Medicare (federal insurance for 65+ or disabled)',
            'Private insurance': 'Employer-sponsored or individual private plan',
            'Uninsured/Self-pay': 'No insurance, paying out of pocket',
            'Other public': 'Other government program (VA, CHIP, etc.)'
        }
    }
}

# ----- COMPUTE STATISTICS -----
codebook_entries = []
for var_name, config in variable_config.items():
    entry = {
        'name': var_name,
        'label': config['label'],
        'type': config['type'],
        'value_labels': config['value_labels']
    }
    if config['type'] == 'continuous':
        entry.update(compute_continuous_stats(df[var_name]))
        entry['categories'] = None
    else:
        entry.update(compute_categorical_stats(df[var_name]))
        entry['mean'] = entry['std'] = entry['median'] = None
        entry['min'] = entry['max'] = None
    codebook_entries.append(entry)
    print(f"✓ {var_name}")

# ----- CREATE WORD DOCUMENT -----
def create_codebook_document(entries, title, description, output_path):
    doc = Document()
    
    # Title page
    doc.add_heading('Codebook', level=0).alignment = WD_ALIGN_PARAGRAPH.CENTER
    doc.add_paragraph(title).alignment = WD_ALIGN_PARAGRAPH.CENTER
    doc.add_paragraph(f'Generated: {datetime.now().strftime("%B %d, %Y")}').alignment = WD_ALIGN_PARAGRAPH.CENTER
    doc.add_paragraph()
    
    # Overview
    doc.add_heading('Overview', level=1)
    doc.add_paragraph(description)
    n_obs = entries[0]['n_valid'] + entries[0]['n_missing']
    doc.add_paragraph(f"The dataset contains {n_obs:,} observations and {len(entries)} variables.")
    
    # Summary table
    doc.add_heading('Variable Summary', level=1)
    table = doc.add_table(rows=1, cols=4)
    table.style = 'Table Grid'
    for i, h in enumerate(['Variable', 'Label', 'Type', 'Missing']):
        table.rows[0].cells[i].text = h
        table.rows[0].cells[i].paragraphs[0].runs[0].bold = True
    for e in entries:
        row = table.add_row().cells
        row[0].text, row[1].text = e['name'], e['label']
        row[2].text, row[3].text = e['type'].capitalize(), f"{e['pct_missing']}%"
    
    doc.add_page_break()
    
    # Detailed descriptions
    doc.add_heading('Detailed Variable Descriptions', level=1)
    for e in entries:
        doc.add_heading(e['name'], level=2)
        
        # Variable label
        label_para = doc.add_paragraph()
        label_para.add_run('Variable Label: ').bold = True
        label_para.add_run(e['label'])
        
        # Metadata
        meta = f"Type: {e['type'].capitalize()} | Valid N: {e['n_valid']:,} | Missing: {e['n_missing']:,} ({e['pct_missing']}%)"
        doc.add_paragraph(meta).runs[0].italic = True
        
        if e['type'] == 'continuous':
            doc.add_paragraph(f"Mean: {e['mean']} | SD: {e['std']} | Median: {e['median']} | Range: {e['min']} – {e['max']}")
            
            # Clinical interpretation for scales
            if e['value_labels']:
                doc.add_paragraph()
                doc.add_paragraph().add_run('Clinical Interpretation:').bold = True
                it = doc.add_table(rows=1, cols=2)
                it.style = 'Table Grid'
                it.rows[0].cells[0].text = 'Score Range'
                it.rows[0].cells[1].text = 'Interpretation'
                it.rows[0].cells[0].paragraphs[0].runs[0].bold = True
                it.rows[0].cells[1].paragraphs[0].runs[0].bold = True
                for rng, interp in e['value_labels'].items():
                    r = it.add_row().cells
                    r[0].text, r[1].text = rng, interp
                    
        elif e['categories']:
            doc.add_paragraph()
            doc.add_paragraph().add_run('Value Labels and Frequencies:').bold = True
            
            has_vl = e['value_labels'] is not None
            ft = doc.add_table(rows=1, cols=4 if has_vl else 3)
            ft.style = 'Table Grid'
            hdrs = ['Value', 'Description', 'Count', 'Percent'] if has_vl else ['Value', 'Count', 'Percent']
            for i, h in enumerate(hdrs):
                ft.rows[0].cells[i].text = h
                ft.rows[0].cells[i].paragraphs[0].runs[0].bold = True
            for cat in e['categories']:
                r = ft.add_row().cells
                r[0].text = cat['value']
                if has_vl:
                    r[1].text = e['value_labels'].get(cat['value'], '')
                    r[2].text, r[3].text = f"{cat['count']:,}", f"{cat['percent']}%"
                else:
                    r[1].text, r[2].text = f"{cat['count']:,}", f"{cat['percent']}%"
        
        doc.add_paragraph()
    
    doc.save(output_path)
    print(f"\n✓ Codebook saved: {output_path}")

# ----- GENERATE -----
output_path = '/content/drive/MyDrive/mental_health_codebook.docx'
description = (
    "This codebook documents variables from a community mental health program "
    "evaluation. Data were collected from clients receiving services between "
    "January and December 2024. Variables include demographics, clinical "
    "assessments (PHQ-9, GAD-7), and service utilization."
)
create_codebook_document(codebook_entries, 
    'Community Mental Health Program\nClient Services Dataset 2024',
    description, output_path)

# ================================================================
# CODEBOOK GENERATOR
# Community Mental Health Program Evaluation
# ================================================================

# ----- SETUP -----
from google.colab import drive
drive.mount('/content/drive')

!pip install python-docx -q

import pandas as pd
import numpy as np
from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
from datetime import datetime

# ----- LOAD DATA -----
df = pd.read_csv('/content/drive/MyDrive/client_data.csv')
print(f"Dataset: {df.shape[0]} clients, {df.shape[1]} variables")

# ----- STATISTICS FUNCTIONS -----
def compute_continuous_stats(series):
    return {
        'n_valid': int(series.notna().sum()),
        'n_missing': int(series.isna().sum()),
        'pct_missing': round(series.isna().mean() * 100, 1),
        'mean': round(series.mean(), 2),
        'std': round(series.std(), 2),
        'median': round(series.median(), 2),
        'min': int(series.min()) if series.notna().any() else None,
        'max': int(series.max()) if series.notna().any() else None
    }

def compute_categorical_stats(series):
    vc = series.value_counts(dropna=True)
    pct = series.value_counts(normalize=True, dropna=True) * 100
    categories = [{'value': str(v), 'count': int(vc[v]), 'percent': round(pct[v], 1)} 
                  for v in vc.index]
    return {
        'n_valid': int(series.notna().sum()),
        'n_missing': int(series.isna().sum()),
        'pct_missing': round(series.isna().mean() * 100, 1),
        'categories': categories
    }

# ----- VARIABLE CONFIGURATION WITH LABELS -----
variable_config = {
    'age': {
        'label': 'Age in years',
        'type': 'continuous',
        'value_labels': None
    },
    'gender': {
        'label': 'Gender identity',
        'type': 'categorical',
        'value_labels': {
            'Female': 'Identifies as female',
            'Male': 'Identifies as male',
            'Non-binary': 'Identifies as non-binary',
            'Prefer not to say': 'Declined to respond'
        }
    },
    'race_ethnicity': {
        'label': 'Race/Ethnicity (self-reported)',
        'type': 'categorical',
        'value_labels': {
            'White': 'White, non-Hispanic',
            'Black or African American': 'Black or African American, non-Hispanic',
            'Hispanic or Latino': 'Hispanic or Latino, any race',
            'Asian': 'Asian, non-Hispanic',
            'Multiracial': 'Two or more races',
            'Other': 'Other race/ethnicity'
        }
    },
    'primary_language': {
        'label': 'Primary language spoken at home',
        'type': 'categorical',
        'value_labels': {
            'English': 'English',
            'Spanish': 'Spanish',
            'Korean': 'Korean',
            'Mandarin': 'Mandarin Chinese',
            'Arabic': 'Arabic',
            'Other': 'Other language'
        }
    },
    'education': {
        'label': 'Highest education level completed',
        'type': 'categorical',
        'value_labels': {
            'Less than high school': 'Did not complete high school',
            'High school diploma/GED': 'High school graduate or equivalent',
            'Some college': 'Some college, no degree',
            "Bachelor's degree": 'Four-year college degree',
            'Graduate degree': "Master's, doctoral, or professional degree"
        }
    },
    'employment_status': {
        'label': 'Current employment status',
        'type': 'categorical',
        'value_labels': {
            'Employed full-time': 'Working 35+ hours per week',
            'Employed part-time': 'Working less than 35 hours per week',
            'Unemployed': 'Not employed, seeking work',
            'Disabled': 'Not employed due to disability',
            'Retired': 'Retired from work',
            'Student': 'Full-time student'
        }
    },
    'phq9_baseline': {
        'label': 'PHQ-9 depression score at intake (0-27)',
        'type': 'continuous',
        'value_labels': {
            '0-4': 'Minimal depression',
            '5-9': 'Mild depression',
            '10-14': 'Moderate depression',
            '15-19': 'Moderately severe depression',
            '20-27': 'Severe depression'
        }
    },
    'phq9_followup': {
        'label': 'PHQ-9 depression score at follow-up (0-27)',
        'type': 'continuous',
        'value_labels': {
            '0-4': 'Minimal depression',
            '5-9': 'Mild depression',
            '10-14': 'Moderate depression',
            '15-19': 'Moderately severe depression',
            '20-27': 'Severe depression'
        }
    },
    'gad7_baseline': {
        'label': 'GAD-7 anxiety score at intake (0-21)',
        'type': 'continuous',
        'value_labels': {
            '0-4': 'Minimal anxiety',
            '5-9': 'Mild anxiety',
            '10-14': 'Moderate anxiety',
            '15-21': 'Severe anxiety'
        }
    },
    'sessions_attended': {
        'label': 'Number of therapy sessions attended',
        'type': 'continuous',
        'value_labels': None
    },
    'service_type': {
        'label': 'Primary service received',
        'type': 'categorical',
        'value_labels': {
            'Individual therapy': 'One-on-one therapy sessions',
            'Group therapy': 'Therapy in a group setting',
            'Case management': 'Care coordination and resource linkage',
            'Psychiatric services': 'Medication management with psychiatrist',
            'Crisis intervention': 'Emergency mental health services'
        }
    },
    'insurance_type': {
        'label': 'Insurance/payment type',
        'type': 'categorical',
        'value_labels': {
            'Medicaid': 'Medicaid (state insurance for low-income)',
            'Medicare': 'Medicare (federal insurance for 65+ or disabled)',
            'Private insurance': 'Employer-sponsored or individual private plan',
            'Uninsured/Self-pay': 'No insurance, paying out of pocket',
            'Other public': 'Other government program (VA, CHIP, etc.)'
        }
    }
}

# ----- COMPUTE STATISTICS -----
codebook_entries = []
for var_name, config in variable_config.items():
    entry = {
        'name': var_name,
        'label': config['label'],
        'type': config['type'],
        'value_labels': config['value_labels']
    }
    if config['type'] == 'continuous':
        entry.update(compute_continuous_stats(df[var_name]))
        entry['categories'] = None
    else:
        entry.update(compute_categorical_stats(df[var_name]))
        entry['mean'] = entry['std'] = entry['median'] = None
        entry['min'] = entry['max'] = None
    codebook_entries.append(entry)
    print(f"✓ {var_name}")

# ----- CREATE WORD DOCUMENT -----
def create_codebook_document(entries, title, description, output_path):
    doc = Document()
    
    # Title page
    doc.add_heading('Codebook', level=0).alignment = WD_ALIGN_PARAGRAPH.CENTER
    doc.add_paragraph(title).alignment = WD_ALIGN_PARAGRAPH.CENTER
    doc.add_paragraph(f'Generated: {datetime.now().strftime("%B %d, %Y")}').alignment = WD_ALIGN_PARAGRAPH.CENTER
    doc.add_paragraph()
    
    # Overview
    doc.add_heading('Overview', level=1)
    doc.add_paragraph(description)
    n_obs = entries[0]['n_valid'] + entries[0]['n_missing']
    doc.add_paragraph(f"The dataset contains {n_obs:,} observations and {len(entries)} variables.")
    
    # Summary table
    doc.add_heading('Variable Summary', level=1)
    table = doc.add_table(rows=1, cols=4)
    table.style = 'Table Grid'
    for i, h in enumerate(['Variable', 'Label', 'Type', 'Missing']):
        table.rows[0].cells[i].text = h
        table.rows[0].cells[i].paragraphs[0].runs[0].bold = True
    for e in entries:
        row = table.add_row().cells
        row[0].text, row[1].text = e['name'], e['label']
        row[2].text, row[3].text = e['type'].capitalize(), f"{e['pct_missing']}%"
    
    doc.add_page_break()
    
    # Detailed descriptions
    doc.add_heading('Detailed Variable Descriptions', level=1)
    for e in entries:
        doc.add_heading(e['name'], level=2)
        
        # Variable label
        label_para = doc.add_paragraph()
        label_para.add_run('Variable Label: ').bold = True
        label_para.add_run(e['label'])
        
        # Metadata
        meta = f"Type: {e['type'].capitalize()} | Valid N: {e['n_valid']:,} | Missing: {e['n_missing']:,} ({e['pct_missing']}%)"
        doc.add_paragraph(meta).runs[0].italic = True
        
        if e['type'] == 'continuous':
            doc.add_paragraph(f"Mean: {e['mean']} | SD: {e['std']} | Median: {e['median']} | Range: {e['min']} – {e['max']}")
            
            # Clinical interpretation for scales
            if e['value_labels']:
                doc.add_paragraph()
                doc.add_paragraph().add_run('Clinical Interpretation:').bold = True
                it = doc.add_table(rows=1, cols=2)
                it.style = 'Table Grid'
                it.rows[0].cells[0].text = 'Score Range'
                it.rows[0].cells[1].text = 'Interpretation'
                it.rows[0].cells[0].paragraphs[0].runs[0].bold = True
                it.rows[0].cells[1].paragraphs[0].runs[0].bold = True
                for rng, interp in e['value_labels'].items():
                    r = it.add_row().cells
                    r[0].text, r[1].text = rng, interp
                    
        elif e['categories']:
            doc.add_paragraph()
            doc.add_paragraph().add_run('Value Labels and Frequencies:').bold = True
            
            has_vl = e['value_labels'] is not None
            ft = doc.add_table(rows=1, cols=4 if has_vl else 3)
            ft.style = 'Table Grid'
            hdrs = ['Value', 'Description', 'Count', 'Percent'] if has_vl else ['Value', 'Count', 'Percent']
            for i, h in enumerate(hdrs):
                ft.rows[0].cells[i].text = h
                ft.rows[0].cells[i].paragraphs[0].runs[0].bold = True
            for cat in e['categories']:
                r = ft.add_row().cells
                r[0].text = cat['value']
                if has_vl:
                    r[1].text = e['value_labels'].get(cat['value'], '')
                    r[2].text, r[3].text = f"{cat['count']:,}", f"{cat['percent']}%"
                else:
                    r[1].text, r[2].text = f"{cat['count']:,}", f"{cat['percent']}%"
        
        doc.add_paragraph()
    
    doc.save(output_path)
    print(f"\n✓ Codebook saved: {output_path}")

# ----- GENERATE -----
output_path = '/content/drive/MyDrive/mental_health_codebook.docx'
description = (
    "This codebook documents variables from a community mental health program "
    "evaluation. Data were collected from clients receiving services between "
    "January and December 2024. Variables include demographics, clinical "
    "assessments (PHQ-9, GAD-7), and service utilization."
)
create_codebook_document(codebook_entries, 
    'Community Mental Health Program\nClient Services Dataset 2024',
    description, output_path)

This code will generate the codebook document in MS word format as follws:

Troubleshooting

“KeyError: variable not found”: The variable name in your configuration doesn’t match the DataFrame column. Check spelling and capitalization with df.columns.tolist().

Value labels don’t match actual data values: If your data has values that aren’t in your value_labels dictionary, those rows will show empty descriptions. Make sure every possible value in your data has a corresponding label.

Word document won’t open: Corrupted XML from python-docx is rare but possible. Try a simpler document first to isolate the problem. Make sure all table rows have the correct number of cells.

Missing data percentages seem wrong: Remember that categorical percentages exclude missing values (they show distribution among valid responses only). This is standard practice but worth noting in your codebook.

Bonus: Using LLMs to Draft a Complete Codebook

Instead of manually writing the variable_config dictionary and statistics functions, you can ask an LLM to generate the complete Python code. First, extract your data structure:

Python

# Extract data structure for LLM prompt
print("=== DATA STRUCTURE ===\n")
print(f"Filename: [your_file.csv]")
print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns\n")

for col in df.columns:
    n_unique = df[col].nunique()       # Number of unique values
    n_total = len(df[col].dropna())    # Total non-missing values
    dtype = df[col].dtype              # Data type (object, int64, float64, etc.)
    
    # Determine variable type
    if dtype == 'object':
        # String type is always categorical
        var_type = "categorical"
    elif n_unique <= 20 or n_unique / n_total < 0.05:
        # Numeric but few unique values (<=20) or low cardinality (<5%) 
        # suggests categorical (e.g., Likert scale 1-5, coded responses)
        var_type = "categorical"
    else:
        # Numeric with many unique values is continuous
        var_type = "continuous"
    
    # Print based on type
    if var_type == "continuous":
        print(f"{col} ({var_type}): range {df[col].min()} - {df[col].max()}")
    else:
        values = df[col].dropna().unique().tolist()
        print(f"{col} ({var_type}): {values}")

# Extract data structure for LLM prompt
print("=== DATA STRUCTURE ===\n")
print(f"Filename: [your_file.csv]")
print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns\n")

for col in df.columns:
    n_unique = df[col].nunique()       # Number of unique values
    n_total = len(df[col].dropna())    # Total non-missing values
    dtype = df[col].dtype              # Data type (object, int64, float64, etc.)
    
    # Determine variable type
    if dtype == 'object':
        # String type is always categorical
        var_type = "categorical"
    elif n_unique <= 20 or n_unique / n_total < 0.05:
        # Numeric but few unique values (<=20) or low cardinality (<5%) 
        # suggests categorical (e.g., Likert scale 1-5, coded responses)
        var_type = "categorical"
    else:
        # Numeric with many unique values is continuous
        var_type = "continuous"
    
    # Print based on type
    if var_type == "continuous":
        print(f"{col} ({var_type}): range {df[col].min()} - {df[col].max()}")
    else:
        values = df[col].dropna().unique().tolist()
        print(f"{col} ({var_type}): {values}")

Here is an example prompt with the information that is extacted above:

Prompt:

Write complete Python code for Google Colab that creates a codebook Word document.

**Data structure:**
[Paste the output from the code above]

**Additional context (if applicable):**
- [Describe any standardized instruments, e.g., "PHQ-9 depression scale with clinical cutoffs"]
- [Attach or describe your questionnaire]
- [Note any variables to skip, e.g., "client_id is an identifier, exclude from codebook"]

**Requirements:**
1. Load data from CSV file
2. Define variable_config dictionary with:
   - Variable labels (human-readable descriptions)
   - Variable types (continuous/categorical)
   - Value labels for categorical variables
3. Compute descriptive statistics:
   - Continuous: n, missing, mean, SD, median, min, max
   - Categorical: n, missing, frequency counts, percentages
4. Export to Word document with:
   - Title page with dataset name and date
   - Variable summary table
   - Detailed variable descriptions with statistics tables

Generate complete, runnable code.

The LLM will generate the full codes customized to your specific variables and value labels.

Resources

python-docx Documentation: https://python-docx.readthedocs.io/

January 1, 2026

[Python] Creating/writing a codebook in Python using pandas and python-docx

Step 1: Setting Up Your Environment

Step 2: Loading Data

Step 3: Exploring Your Data

Step 4: Understanding Variable Types

Step 5: Variable Labels vs. Value Labels

Step 6: Building the Statistics Functions

Step 7: Defining Variable Metadata with Labels

Step 8: Computing All Statistics

Step 9: Creating the Word Document

The Complete Code

Troubleshooting

Bonus: Using LLMs to Draft a Complete Codebook

Resources

Related Posts

Leave a ReplyCancel reply

Translate this page into:

Categories

[Python] Creating/writing a codebook in Python using pandas and python-docx

Step 1: Setting Up Your Environment

Step 2: Loading Data

Step 3: Exploring Your Data

Step 4: Understanding Variable Types

Step 5: Variable Labels vs. Value Labels

Step 6: Building the Statistics Functions

Step 7: Defining Variable Metadata with Labels

Step 8: Computing All Statistics

Step 9: Creating the Word Document

The Complete Code

Troubleshooting

Bonus: Using LLMs to Draft a Complete Codebook

Resources

Share this:

Related Posts

Leave a ReplyCancel reply

Translate this page into:

Categories