[Python] Data Wrangling and Variable Manipulation using pandas
Raw data is rarely ready for analysis. Survey responses have missing values. Administrative records use codes that need translation. Variables exist in forms that don’t match your analytical needs—age as a continuous number when you need age groups, nine separate PHQ-9 items when you need a total score.
Data wrangling refers to the process between “I have data” and “I can analyze it.” This includes cleaning (handling missing values, fixing errors), transforming (converting variable types, creating categories), and deriving new variables from existing ones. In practice, 60-80% of data work happens here.
For example, you receive client intake data with PHQ-9 responses stored as nine separate columns. Your funder wants depression rates by age group. You need to (1) sum the nine items into a total score, (2) apply clinical cutoffs, and (3) create age groups. That’s data wrangling.
This post covers variable types, variable transformation, composite variable creation, and missing data handling.
Understanding Variable Types
The type of variable determines what operations are valid, what statistics you can calculate, and how to visualize it.
| Variable Type | Description | Examples | Statistics |
|---|---|---|---|
| Continuous | Numeric values with meaningful differences and ratios | Age (18, 25, 34), Income ($32,500), PHQ-9 score (0-27) | Mean, median, standard deviation, range |
| Ordinal | Categories with a natural order, but unequal intervals | Education level (HS < College < Graduate), Depression severity (Minimal < Mild < Moderate) | Median, mode, percentiles |
| Nominal | Categories without inherent order | Gender (Male, Female, Non-binary), Service type (Counseling, Housing) | Mode, frequencies |
| Binary | Two mutually exclusive categories | Employed (Yes/No), Has insurance (0/1) | Proportions, odds ratios |
📍 You can move down this hierarchy (continuous → ordinal → binary), but not up. Once you create age groups, you cannot recover original ages. Keep data at the highest level until you have a reason to convert.
This hierarchy matters because it determines your analysis (and visualization) options. Keep data at the highest level until you have a reason to convert it. You might need age as continuous for one analysis and as age groups for another.
Variable Transformation
| Operation | Description | Example |
|---|---|---|
| Type conversion | Continuous → Ordinal → Binary | Age → Age groups → Young adult (Y/N) PHQ-9 score → Severity level |
| Aggregation | Combine multiple items into one score | PHQ-9 Q1-Q9 → Total score |
| Co-occurrence | Flag cases meeting multiple conditions | Unemployed AND Uninsured → Vulnerable |
| Missing handling | Drop, impute, or flag missing values | Income missing → Median imputation |
(1) Type Conversion
Converting continuous variables to categorical or binary is one of the most common data wrangling tasks. You might need age groups for stratified analysis, or a binary flag to identify a target population. The key is choosing appropriate cutoffs: these should be driven by theory, clinical guidelines, or your research question, not arbitrary decisions.
A. Continuous → Ordinal
There are several approaches to converting continuous variables into ordinal categories. You might use equal-width bins (every 10 years of age), quantile-based bins (quartiles, so each group has roughly 25% of cases), or theory-driven cutoffs (clinical thresholds established in the literature). The right choice depends on your research question and how the variable will be used in analysis.
Example: Age → Age groups
pd.cut() bins continuous values into ordered categories. You specify the bin edges and labels. The function assigns each value to the appropriate bin.
df['age_group'] = pd.cut(df['age'],
bins=[0, 25, 40, 55, 100],
labels=['18-25', '26-40', '41-55', '56+'])
Example: PHQ-9 total → Severity categories
Continuous scores often need to be translated into meaningful categories for interpretation or reporting. Clinical measures typically have established cutoffs—PHQ-9 uses 5/10/15/20 to define severity levels; BMI uses 18.5/25/30 to define weight categories.
.apply() runs a custom function on every value in a column. You define the categorization logic once, and pandas handles the iteration. This approach is flexible—you can implement any conditional logic you need.
def phq9_severity(score):
if pd.isna(score):
return np.nan
elif score <= 4:
return 'Minimal'
elif score <= 9:
return 'Mild'
elif score <= 14:
return 'Moderate'
elif score <= 19:
return 'Moderately Severe'
else:
return 'Severe'
df['phq9_severity'] = df['phq9_total'].apply(phq9_severity)

B. Continuous → Binary
A boolean (true/false) condition returns True/False, which .astype(int) converts to 1/0. This is useful for creating eligibility flags, clinical thresholds, or grouping variables for comparison.
Example: Age → Young adult flag
df['is_young_adult'] = (df['age'] <= 30).astype(int)

(2) Aggregation: Creating Composite Scores
Many standardized measures consist of multiple items that need to be combined into a total or subscale score. The PHQ-9 has 9 items; the GAD-7 has 7. You rarely analyze individual items—instead, you sum (or average) them into a composite score that represents the underlying construct.
.sum(axis=1) adds values across columns for each row. If you need an average instead, use .mean(axis=1).
Example: PHQ-9 items → Total score
phq9_cols = [f'phq9_q{i}' for i in range(1, 10)]
df['phq9_total'] = df[phq9_cols].sum(axis=1)
Note: If any item is missing, sum returns NaN. This is intentional—a partial PHQ-9 score is not clinically valid. Handle missing items before aggregation if you need a different behavior.
(3) Using Multiple Conditions
Sometimes a single variable isn’t enough to identify your population of interest. You might need to flag clients who are both unemployed AND uninsured, or those with both elevated symptoms AND a specific risk factor. These compound conditions help identify subgroups that face multiple challenges simultaneously.
Use & for AND (both conditions must be true) and | for OR (either condition can be true). Parentheses around each condition are required in pandas.
📍 Parentheses around each condition are required.
Example: 1 Vulnerable population flag
# Unemployed AND uninsured
df['vulnerable'] = ((df['employed'] == 0) & (df['has_insurance'] == 0)).astype(int)
Example 2: High-risk flag (PHQ-9 Q9 suicidal ideation + severe symptoms)
df['high_risk'] = ((df['phq9_q9'] >= 2) & (df['phq9_total'] >= 15)).astype(int)
Example 3: Race/Ethnicity
Federal datasets typically collect Hispanic/Latino ethnicity separately from race. When you need a single mutually exclusive race/ethnicity variable, Hispanic ethnicity usually takes precedence:
def create_race_ethnicity(row):
if row['hispanic'] == 1:
return 'Hispanic/Latino'
elif row['race'] == 1:
return 'White, non-Hispanic'
elif row['race'] == 2:
return 'Black, non-Hispanic'
elif row['race'] == 3:
return 'Asian, non-Hispanic'
else:
return 'Other/Multiple, non-Hispanic'
df['race_ethnicity'] = df.apply(create_race_ethnicity, axis=1)
Example 4: Immigrant Generation
Combining nativity (birthplace) and age at migration creates an ordinal variable that captures both migration status and developmental timing of arrival:
def create_immigrant_generation(row):
if row['foreign_born'] == 0:
return '2nd+ generation' # US-born
elif row['age_at_migration'] <= 12:
return '1.5 generation' # Foreign-born, arrived as child
else:
return '1st generation' # Foreign-born, arrived as adult
df['immigrant_gen'] = df.apply(create_immigrant_generation, axis=1)
The ordering here matters: 1st generation → 1.5 generation → 2nd+ generation reflects increasing exposure to US institutions during formative years.
(4) Handling Missing Values
Missing data is inevitable in real-world datasets. How you handle it depends on why it’s missing and what you’re trying to do. Before choosing a strategy, examine the pattern: How much is missing? Which variables? Is missingness random or systematic?

Strategy 1: Drop incomplete cases
Use when partial data is invalid (e.g., PHQ-9 total requires all 9 items). This method is called as “listwise deletion,” which means that you remove any row or observation that contains at least one missing value.
df_complete = df.dropna(subset=phq9_cols)
Strategy 2: Imputation
Replace missing with central tendency. Use when missingness is small and random. Please read this post to learn more about how to deal with missing data with imputation.
df['income'] = df['income'].fillna(df['income'].median())

Strategy 3: Missing indicator
Create a flag to preserve information about missingness.
df['income_missing'] = df['income'].isnull().astype(int)
Same Variable, Multiple Representations
Different variable types enable different visualizations. Keep data at the highest level until you need to convert.

Resources
pandas missing data: https://pandas.pydata.org/docs/user_guide/missing_data.html
