[Python] Data wrangling and variable manipulation using pandas

https://medium.com/@jananiramanathan/data-cleaning-is-easy-now-through-cleanlab-53a30bd3d07a

Raw data is rarely ready for analysis. Survey responses have missing values. Administrative records use codes that need translation. Variables exist in forms that don’t match your analytical needs—age as a continuous number when you need age groups, nine separate PHQ-9 items when you need a total score.

Data wrangling refers to the process between “I have data” and “I can analyze it.” This includes cleaning (handling missing values, fixing errors), transforming (converting variable types, creating categories), and deriving new variables from existing ones. In practice, 60-80% of data work happens here.

For example, you receive client intake data with PHQ-9 responses stored as nine separate columns. Your funder wants depression rates by age group. You need to (1) sum the nine items into a total score, (2) apply clinical cutoffs, and (3) create age groups. That’s data wrangling.

This post covers variable types, variable transformation, composite variable creation, and missing data handling.

Understanding Variable Types

The type of variable determines what operations are valid, what statistics you can calculate, and how to visualize it.

Variable Type	Description	Examples	Statistics
Continuous	Numeric values with meaningful differences and ratios	Age (18, 25, 34), Income ($32,500), PHQ-9 score (0-27)	Mean, median, standard deviation, range
Ordinal	Categories with a natural order, but unequal intervals	Education level (HS < College < Graduate), Depression severity (Minimal < Mild < Moderate)	Median, mode, percentiles
Nominal	Categories without inherent order	Gender (Male, Female, Non-binary), Service type (Counseling, Housing)	Mode, frequencies
Binary	Two mutually exclusive categories	Employed (Yes/No), Has insurance (0/1)	Proportions, odds ratios

📍 You can move down this hierarchy (continuous → ordinal → binary), but not up. Once you create age groups, you cannot recover original ages. Keep data at the highest level until you have a reason to convert.

This hierarchy matters because it determines your analysis (and visualization) options. Keep data at the highest level until you have a reason to convert it. You might need age as continuous for one analysis and as age groups for another.

Variable Transformation

Operation	Description	Example
Type conversion	Continuous → Ordinal → Binary	Age → Age groups → Young adult (Y/N) PHQ-9 score → Severity level
Aggregation	Combine multiple items into one score	PHQ-9 Q1-Q9 → Total score
Co-occurrence	Flag cases meeting multiple conditions	Unemployed AND Uninsured → Vulnerable
Missing handling	Drop, impute, or flag missing values	Income missing → Median imputation

(1) Type Conversion

Converting continuous variables to categorical or binary is one of the most common data wrangling tasks. You might need age groups for stratified analysis, or a binary flag to identify a target population. The key is choosing appropriate cutoffs: these should be driven by theory, clinical guidelines, or your research question, not arbitrary decisions.

A. Continuous → Ordinal

There are several approaches to converting continuous variables into ordinal categories. You might use equal-width bins (every 10 years of age), quantile-based bins (quartiles, so each group has roughly 25% of cases), or theory-driven cutoffs (clinical thresholds established in the literature). The right choice depends on your research question and how the variable will be used in analysis.

Example: Age → Age groups

pd.cut() bins continuous values into ordered categories. You specify the bin edges and labels. The function assigns each value to the appropriate bin.

Python

df['age_group'] = pd.cut(df['age'], 
                          bins=[0, 25, 40, 55, 100],
                          labels=['18-25', '26-40', '41-55', '56+'])

df['age_group'] = pd.cut(df['age'], 
                          bins=[0, 25, 40, 55, 100],
                          labels=['18-25', '26-40', '41-55', '56+'])

Example: PHQ-9 total → Severity categories

Continuous scores often need to be translated into meaningful categories for interpretation or reporting. Clinical measures typically have established cutoffs—PHQ-9 uses 5/10/15/20 to define severity levels; BMI uses 18.5/25/30 to define weight categories.

.apply() runs a custom function on every value in a column. You define the categorization logic once, and pandas handles the iteration. This approach is flexible—you can implement any conditional logic you need.

Python

def phq9_severity(score):
    if pd.isna(score):
        return np.nan
    elif score <= 4:
        return 'Minimal'
    elif score <= 9:
        return 'Mild'
    elif score <= 14:
        return 'Moderate'
    elif score <= 19:
        return 'Moderately Severe'
    else:
        return 'Severe'

df['phq9_severity'] = df['phq9_total'].apply(phq9_severity)

def phq9_severity(score):
    if pd.isna(score):
        return np.nan
    elif score <= 4:
        return 'Minimal'
    elif score <= 9:
        return 'Mild'
    elif score <= 14:
        return 'Moderate'
    elif score <= 19:
        return 'Moderately Severe'
    else:
        return 'Severe'

df['phq9_severity'] = df['phq9_total'].apply(phq9_severity)

B. Continuous → Binary

A boolean (true/false) condition returns True/False, which .astype(int) converts to 1/0. This is useful for creating eligibility flags, clinical thresholds, or grouping variables for comparison.

Example: Age → Young adult flag

Python

df['is_young_adult'] = (df['age'] <= 30).astype(int)

df['is_young_adult'] = (df['age'] <= 30).astype(int)

(2) Aggregation: Creating Composite Scores

Many standardized measures consist of multiple items that need to be combined into a total or subscale score. The PHQ-9 has 9 items; the GAD-7 has 7. You rarely analyze individual items—instead, you sum (or average) them into a composite score that represents the underlying construct.

.sum(axis=1) adds values across columns for each row. If you need an average instead, use .mean(axis=1).

Example: PHQ-9 items → Total score

Python

phq9_cols = [f'phq9_q{i}' for i in range(1, 10)]
df['phq9_total'] = df[phq9_cols].sum(axis=1)

phq9_cols = [f'phq9_q{i}' for i in range(1, 10)]
df['phq9_total'] = df[phq9_cols].sum(axis=1)

Note: If any item is missing, sum returns NaN. This is intentional—a partial PHQ-9 score is not clinically valid. Handle missing items before aggregation if you need a different behavior.

(3) Using Multiple Conditions

Sometimes a single variable isn’t enough to identify your population of interest. You might need to flag clients who are both unemployed AND uninsured, or those with both elevated symptoms AND a specific risk factor. These compound conditions help identify subgroups that face multiple challenges simultaneously.

Use & for AND (both conditions must be true) and | for OR (either condition can be true). Parentheses around each condition are required in pandas.

📍 Parentheses around each condition are required.

Example: 1 Vulnerable population flag

Python

# Unemployed AND uninsured
df['vulnerable'] = ((df['employed'] == 0) & (df['has_insurance'] == 0)).astype(int)

# Unemployed AND uninsured
df['vulnerable'] = ((df['employed'] == 0) & (df['has_insurance'] == 0)).astype(int)

Example 2: High-risk flag (PHQ-9 Q9 suicidal ideation + severe symptoms)

Python

df['high_risk'] = ((df['phq9_q9'] >= 2) & (df['phq9_total'] >= 15)).astype(int)

df['high_risk'] = ((df['phq9_q9'] >= 2) & (df['phq9_total'] >= 15)).astype(int)

Example 3: Race/Ethnicity

Federal datasets typically collect Hispanic/Latino ethnicity separately from race. When you need a single mutually exclusive race/ethnicity variable, Hispanic ethnicity usually takes precedence:

Python

def create_race_ethnicity(row):
    if row['hispanic'] == 1:
        return 'Hispanic/Latino'
    elif row['race'] == 1:
        return 'White, non-Hispanic'
    elif row['race'] == 2:
        return 'Black, non-Hispanic'
    elif row['race'] == 3:
        return 'Asian, non-Hispanic'
    else:
        return 'Other/Multiple, non-Hispanic'

df['race_ethnicity'] = df.apply(create_race_ethnicity, axis=1)

def create_race_ethnicity(row):
    if row['hispanic'] == 1:
        return 'Hispanic/Latino'
    elif row['race'] == 1:
        return 'White, non-Hispanic'
    elif row['race'] == 2:
        return 'Black, non-Hispanic'
    elif row['race'] == 3:
        return 'Asian, non-Hispanic'
    else:
        return 'Other/Multiple, non-Hispanic'

df['race_ethnicity'] = df.apply(create_race_ethnicity, axis=1)

Example 4: Immigrant Generation

Combining nativity (birthplace) and age at migration creates an ordinal variable that captures both migration status and developmental timing of arrival:

Python

def create_immigrant_generation(row):
    if row['foreign_born'] == 0:
        return '2nd+ generation'  # US-born
    elif row['age_at_migration'] <= 12:
        return '1.5 generation'   # Foreign-born, arrived as child
    else:
        return '1st generation'   # Foreign-born, arrived as adult

df['immigrant_gen'] = df.apply(create_immigrant_generation, axis=1)

def create_immigrant_generation(row):
    if row['foreign_born'] == 0:
        return '2nd+ generation'  # US-born
    elif row['age_at_migration'] <= 12:
        return '1.5 generation'   # Foreign-born, arrived as child
    else:
        return '1st generation'   # Foreign-born, arrived as adult

df['immigrant_gen'] = df.apply(create_immigrant_generation, axis=1)

The ordering here matters: 1st generation → 1.5 generation → 2nd+ generation reflects increasing exposure to US institutions during formative years.

(4) Handling Missing Values

Missing data is inevitable in real-world datasets. How you handle it depends on why it’s missing and what you’re trying to do. Before choosing a strategy, examine the pattern: How much is missing? Which variables? Is missingness random or systematic?

Strategy 1: Drop incomplete cases

Use when partial data is invalid (e.g., PHQ-9 total requires all 9 items). This method is called as “listwise deletion,” which means that you remove any row or observation that contains at least one missing value.

Python

df_complete = df.dropna(subset=phq9_cols)

df_complete = df.dropna(subset=phq9_cols)

Strategy 2: Imputation

Replace missing with central tendency. Use when missingness is small and random. Please read this post to learn more about how to deal with missing data with imputation.

Python

df['income'] = df['income'].fillna(df['income'].median())

df['income'] = df['income'].fillna(df['income'].median())

Strategy 3: Missing indicator

Create a flag to preserve information about missingness.

Python

df['income_missing'] = df['income'].isnull().astype(int)

df['income_missing'] = df['income'].isnull().astype(int)

Same Variable, Multiple Representations

Different variable types enable different visualizations. Keep data at the highest level until you need to convert.

Resources

pandas missing data: https://pandas.pydata.org/docs/user_guide/missing_data.html

January 2, 2026

[Python] Data wrangling and variable manipulation using pandas

Understanding Variable Types

Variable Transformation

(1) Type Conversion

(2) Aggregation: Creating Composite Scores

(3) Using Multiple Conditions

(4) Handling Missing Values

Same Variable, Multiple Representations

Resources

Related Posts

Leave a ReplyCancel reply

Translate this page into:

Categories

[Python] Data wrangling and variable manipulation using pandas

Understanding Variable Types

Variable Transformation

(1) Type Conversion

(2) Aggregation: Creating Composite Scores

(3) Using Multiple Conditions

(4) Handling Missing Values

Same Variable, Multiple Representations

Resources

Share this:

Related Posts

Leave a ReplyCancel reply

Translate this page into:

Categories