[Python] Data Wrangling and Variable Manipulation using pandas

Raw data is rarely ready for analysis. Survey responses have missing values. Administrative records use codes that need translation. Variables exist in forms that don’t match your analytical needs—age as a continuous number when you need age groups, nine separate PHQ-9 items when you need a total score.

Data wrangling refers to the process between “I have data” and “I can analyze it.” This includes cleaning (handling missing values, fixing errors), transforming (converting variable types, creating categories), and deriving new variables from existing ones. In practice, 60-80% of data work happens here.

For example, you receive client intake data with PHQ-9 responses stored as nine separate columns. Your funder wants depression rates by age group. You need to (1) sum the nine items into a total score, (2) apply clinical cutoffs, and (3) create age groups. That’s data wrangling.

This post covers variable types, variable transformation, composite variable creation, and missing data handling.

Understanding Variable Types

The type of variable determines what operations are valid, what statistics you can calculate, and how to visualize it.

Variable TypeDescriptionExamplesStatistics
ContinuousNumeric values with meaningful differences and ratiosAge (18, 25, 34), Income ($32,500), PHQ-9 score (0-27)Mean, median, standard deviation, range
OrdinalCategories with a natural order, but unequal intervalsEducation level (HS < College < Graduate), Depression severity (Minimal < Mild < Moderate)Median, mode, percentiles
NominalCategories without inherent orderGender (Male, Female, Non-binary), Service type (Counseling, Housing)Mode, frequencies
BinaryTwo mutually exclusive categoriesEmployed (Yes/No), Has insurance (0/1)Proportions, odds ratios

📍 You can move down this hierarchy (continuous → ordinal → binary), but not up. Once you create age groups, you cannot recover original ages. Keep data at the highest level until you have a reason to convert.

This hierarchy matters because it determines your analysis (and visualization) options. Keep data at the highest level until you have a reason to convert it. You might need age as continuous for one analysis and as age groups for another.

Variable Transformation

OperationDescriptionExample
Type conversionContinuous → Ordinal → BinaryAge → Age groups → Young adult (Y/N)
PHQ-9 score → Severity level
AggregationCombine multiple items into one scorePHQ-9 Q1-Q9 → Total score
Co-occurrenceFlag cases meeting multiple conditionsUnemployed AND Uninsured → Vulnerable
Missing handlingDrop, impute, or flag missing valuesIncome missing → Median imputation

(1) Type Conversion

Converting continuous variables to categorical or binary is one of the most common data wrangling tasks. You might need age groups for stratified analysis, or a binary flag to identify a target population. The key is choosing appropriate cutoffs: these should be driven by theory, clinical guidelines, or your research question, not arbitrary decisions.

A. Continuous → Ordinal

There are several approaches to converting continuous variables into ordinal categories. You might use equal-width bins (every 10 years of age), quantile-based bins (quartiles, so each group has roughly 25% of cases), or theory-driven cutoffs (clinical thresholds established in the literature). The right choice depends on your research question and how the variable will be used in analysis.

Example: Age → Age groups

pd.cut() bins continuous values into ordered categories. You specify the bin edges and labels. The function assigns each value to the appropriate bin.

df['age_group'] = pd.cut(df['age'], 
                          bins=[0, 25, 40, 55, 100],
                          labels=['18-25', '26-40', '41-55', '56+'])

Example: PHQ-9 total → Severity categories

Continuous scores often need to be translated into meaningful categories for interpretation or reporting. Clinical measures typically have established cutoffs—PHQ-9 uses 5/10/15/20 to define severity levels; BMI uses 18.5/25/30 to define weight categories.

.apply() runs a custom function on every value in a column. You define the categorization logic once, and pandas handles the iteration. This approach is flexible—you can implement any conditional logic you need.

def phq9_severity(score):
    if pd.isna(score):
        return np.nan
    elif score <= 4:
        return 'Minimal'
    elif score <= 9:
        return 'Mild'
    elif score <= 14:
        return 'Moderate'
    elif score <= 19:
        return 'Moderately Severe'
    else:
        return 'Severe'

df['phq9_severity'] = df['phq9_total'].apply(phq9_severity)

B. Continuous → Binary

A boolean (true/false) condition returns True/False, which .astype(int) converts to 1/0. This is useful for creating eligibility flags, clinical thresholds, or grouping variables for comparison.

Example: Age → Young adult flag

df['is_young_adult'] = (df['age'] <= 30).astype(int)

(2) Aggregation: Creating Composite Scores

Many standardized measures consist of multiple items that need to be combined into a total or subscale score. The PHQ-9 has 9 items; the GAD-7 has 7. You rarely analyze individual items—instead, you sum (or average) them into a composite score that represents the underlying construct.

.sum(axis=1) adds values across columns for each row. If you need an average instead, use .mean(axis=1).

Example: PHQ-9 items → Total score

phq9_cols = [f'phq9_q{i}' for i in range(1, 10)]
df['phq9_total'] = df[phq9_cols].sum(axis=1)

Note: If any item is missing, sum returns NaN. This is intentional—a partial PHQ-9 score is not clinically valid. Handle missing items before aggregation if you need a different behavior.

(3) Using Multiple Conditions

Sometimes a single variable isn’t enough to identify your population of interest. You might need to flag clients who are both unemployed AND uninsured, or those with both elevated symptoms AND a specific risk factor. These compound conditions help identify subgroups that face multiple challenges simultaneously.

Use & for AND (both conditions must be true) and | for OR (either condition can be true). Parentheses around each condition are required in pandas.

📍 Parentheses around each condition are required.

Example: 1 Vulnerable population flag

# Unemployed AND uninsured
df['vulnerable'] = ((df['employed'] == 0) & (df['has_insurance'] == 0)).astype(int)

Example 2: High-risk flag (PHQ-9 Q9 suicidal ideation + severe symptoms)

df['high_risk'] = ((df['phq9_q9'] >= 2) & (df['phq9_total'] >= 15)).astype(int)

Example 3: Race/Ethnicity

Federal datasets typically collect Hispanic/Latino ethnicity separately from race. When you need a single mutually exclusive race/ethnicity variable, Hispanic ethnicity usually takes precedence:

def create_race_ethnicity(row):
    if row['hispanic'] == 1:
        return 'Hispanic/Latino'
    elif row['race'] == 1:
        return 'White, non-Hispanic'
    elif row['race'] == 2:
        return 'Black, non-Hispanic'
    elif row['race'] == 3:
        return 'Asian, non-Hispanic'
    else:
        return 'Other/Multiple, non-Hispanic'

df['race_ethnicity'] = df.apply(create_race_ethnicity, axis=1)

Example 4: Immigrant Generation

Combining nativity (birthplace) and age at migration creates an ordinal variable that captures both migration status and developmental timing of arrival:

def create_immigrant_generation(row):
    if row['foreign_born'] == 0:
        return '2nd+ generation'  # US-born
    elif row['age_at_migration'] <= 12:
        return '1.5 generation'   # Foreign-born, arrived as child
    else:
        return '1st generation'   # Foreign-born, arrived as adult

df['immigrant_gen'] = df.apply(create_immigrant_generation, axis=1)

The ordering here matters: 1st generation → 1.5 generation → 2nd+ generation reflects increasing exposure to US institutions during formative years.

(4) Handling Missing Values

Missing data is inevitable in real-world datasets. How you handle it depends on why it’s missing and what you’re trying to do. Before choosing a strategy, examine the pattern: How much is missing? Which variables? Is missingness random or systematic?

Strategy 1: Drop incomplete cases

Use when partial data is invalid (e.g., PHQ-9 total requires all 9 items). This method is called as “listwise deletion,” which means that you remove any row or observation that contains at least one missing value. 

df_complete = df.dropna(subset=phq9_cols)

Strategy 2: Imputation

Replace missing with central tendency. Use when missingness is small and random. Please read this post to learn more about how to deal with missing data with imputation.

df['income'] = df['income'].fillna(df['income'].median())

Strategy 3: Missing indicator

Create a flag to preserve information about missingness.

df['income_missing'] = df['income'].isnull().astype(int)

Same Variable, Multiple Representations

Different variable types enable different visualizations. Keep data at the highest level until you need to convert.

Resources

pandas missing data: https://pandas.pydata.org/docs/user_guide/missing_data.html

  • January 2, 2026