[Python] Introduction to pandas library for data analysis

This post introduces pandas, Python’s essential library for working with data. If the previous post on Python basics felt like learning the alphabet, pandas is where you start writing sentences. Actually, pandas is where you start writing entire reports.

Why pandas?

pandas is Python’s tool for working with structured data—spreadsheets, databases, survey results, case records. The name comes from “panel data,” an economics term for datasets tracking the same people over time. But pandas handles any tabular data: client intake forms, program evaluations, administrative records, census data.

Think of pandas as Excel with superpowers. You can load a CSV file, filter rows, calculate statistics, handle missing data, merge datasets, and prepare everything for visualization. The difference is that your work is reproducible. Write your code once, run it on updated data anytime. No clicking through menus trying to remember what you did last month.

More importantly for our purposes, pandas integrates seamlessly with AI tools. You can ask ChatGPT or Claude “how do I filter this DataFrame for clients under 25?” and get working code. The AI understands pandas because pandas is standard across data work in Python.

Your First DataFrame: Loading Data

Let’s start with the most common task—loading a CSV file. CSV (comma-separated values) files are the universal format for datasets. Excel can save as CSV, databases export to CSV, survey platforms download as CSV.

Open Google Colab and create a new notebook. Type this:

Python

import pandas as pd

# Load a CSV file
df = pd.read_csv('client_data.csv')

# Look at the first few rows
print(df.head())

import pandas as pd

# Load a CSV file
df = pd.read_csv('client_data.csv')

# Look at the first few rows
print(df.head())

Three things happened here:

import pandas as pd loads the pandas library. The as pd part is a convention—everyone abbreviates pandas as pd. It’s like how everyone calls the Social Security Administration “SSA.” You could type pandas every time, but why would you?
pd.read_csv() reads the file and creates a DataFrame. A DataFrame is pandas’ core structure—a table with rows and columns, like a spreadsheet.
.head() shows the first 5 rows. This is always your first move with new data. You want to see what you’re working with before doing anything else.

Understanding DataFrames

A DataFrame has three main components:

Columns: Variables in your dataset (age, income, zip_code, service_type)
Rows: Observations or cases (individual clients, program sites, survey responses)
Index: Row labels (usually just numbers: 0, 1, 2, 3…)

Think of it like a case file system. Each row is a client file. Each column is a specific piece of information you track for every client. The index is like a file number—a unique identifier for each row.

Here’s a simple example:

Python

# Create a DataFrame from scratch
data = {
    'name': ['Alice', 'Bob', 'Carol'],
    'age': [28, 45, 34],
    'income': [18000, 32000, 25000]
}

df = pd.DataFrame(data)
print(df)

# Create a DataFrame from scratch
data = {
    'name': ['Alice', 'Bob', 'Carol'],
    'age': [28, 45, 34],
    'income': [18000, 32000, 25000]
}

df = pd.DataFrame(data)
print(df)

Output:

Python

    name  age  income
0  Alice   28   18000
1    Bob   45   32000
2  Carol   34   25000

    name  age  income
0  Alice   28   18000
1    Bob   45   32000
2  Carol   34   25000

The numbers 0, 1, 2 on the left are the index. The column names are at the top. Each row represents one person.

Exploring Your Data

Before analyzing anything, you need to understand what you have. pandas gives you several ways to examine your data:

Python

# Basic information
print(df.info())

# Summary statistics
print(df.describe())

# Check the shape (rows, columns)
print(df.shape)

# List all column names
print(df.columns)

# Basic information
print(df.info())

# Summary statistics
print(df.describe())

# Check the shape (rows, columns)
print(df.shape)

# List all column names
print(df.columns)

.info() tells you how many rows you have, what columns exist, and what data types they are. This is like reading the cover sheet of a case file before diving into the details.

.describe() gives you summary statistics for numeric columns: mean, median, min, max, standard deviation. This is your quick overview—”what’s the average age in this dataset? what’s the range of income?”

.shape returns a tuple like (500, 8) meaning 500 rows and 8 columns. Quick way to check how much data you’re working with.

.columns lists all column names. Useful when you can’t remember if it’s called “client_age” or “age” or “Age” (pandas is case-sensitive, just like Python).

Selecting Data

Most data work involves selecting specific parts of your dataset. pandas makes this straightforward.

Selecting columns:

Python

# Single column (returns a Series)
ages = df['age']

# Multiple columns (returns a DataFrame)
subset = df[['name', 'income']]

# Single column (returns a Series)
ages = df['age']

# Multiple columns (returns a DataFrame)
subset = df[['name', 'income']]

A “Series” is pandas’ term for a single column. It’s like a list, but with an index and more functionality. A DataFrame is a collection of Series.

Selecting rows by position:

Python

# First row
first_client = df.iloc[0]

# First three rows
first_three = df.iloc[0:3]

# Last row
last_client = df.iloc[-1]

# First row
first_client = df.iloc[0]

# First three rows
first_three = df.iloc[0:3]

# Last row
last_client = df.iloc[-1]

.iloc stands for “integer location.” You’re selecting by position, like “give me the 5th row” (which would be df.iloc[4] because we count from 0, remember?).

Selecting rows by condition:

Python

# Clients under 30
young_clients = df[df['age'] < 30]

# High-income clients
high_income = df[df['income'] >= 30000]

# Combining conditions
young_high_income = df[(df['age'] < 30) & (df['income'] >= 30000)]

# Clients under 30
young_clients = df[df['age'] < 30]

# High-income clients
high_income = df[df['income'] >= 30000]

# Combining conditions
young_high_income = df[(df['age'] < 30) & (df['income'] >= 30000)]

This is where pandas starts to feel powerful. Instead of manually filtering through rows, you write a condition and pandas does the work. The result is a new DataFrame containing only rows that match.

The & symbol means “and” (both conditions must be true). Use | for “or” (either condition can be true). The parentheses around each condition are required—another Python quirk that will trip you up if you forget.

Handling Missing Data

Real-world data is messy. Missing values are everywhere—clients who didn’t answer a question, fields that weren’t filled out, data entry errors. pandas represents missing data as NaN (Not a Number).

Python

# Check for missing values
print(df.isnull())

# Count missing values per column
print(df.isnull().sum())

# Drop rows with any missing values
df_clean = df.dropna()

# Fill missing values with a default
df_filled = df.fillna(0)

# Fill missing values with column mean
df['income'] = df['income'].fillna(df['income'].mean())

# Check for missing values
print(df.isnull())

# Count missing values per column
print(df.isnull().sum())

# Drop rows with any missing values
df_clean = df.dropna()

# Fill missing values with a default
df_filled = df.fillna(0)

# Fill missing values with column mean
df['income'] = df['income'].fillna(df['income'].mean())

.isnull() returns a DataFrame of True/False values showing where data is missing.

.dropna() removes any row with missing data. Use this cautiously—you might be throwing away valuable information.

.fillna() replaces missing values with something else. What you use depends on context. For income, maybe you fill with the median. For service type, maybe you fill with “Unknown.” For a yes/no question, maybe you can’t fill at all and need to keep the missing value.

This is where your judgment as a social worker matters. The technical operation is simple (pandas does it in one line), but the decision about what to do with missing data requires understanding your context. Why is the data missing? Is it missing at random, or is there a pattern? Would filling it distort your analysis?

Creating New Columns

Often you need to create new variables based on existing ones:

Python

# Calculate age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 40, 100], labels=['18-25', '26-40', '40+'])

# Create eligibility flag
df['eligible'] = df['income'] < 20000

# Combine columns
df['full_name'] = df['first_name'] + ' ' + df['last_name']

# Apply a function
def risk_score(income):
    if income < 15000:
        return 'High'
    elif income < 25000:
        return 'Medium'
    else:
        return 'Low'

df['risk_level'] = df['income'].apply(risk_score)

# Calculate age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 40, 100], labels=['18-25', '26-40', '40+'])

# Create eligibility flag
df['eligible'] = df['income'] < 20000

# Combine columns
df['full_name'] = df['first_name'] + ' ' + df['last_name']

# Apply a function
def risk_score(income):
    if income < 15000:
        return 'High'
    elif income < 25000:
        return 'Medium'
    else:
        return 'Low'

df['risk_level'] = df['income'].apply(risk_score)

pd.cut() bins numeric data into categories. Great for creating age groups, income brackets, or any classification based on ranges.

.apply() runs a function on every value in a column. You define the function (like the risk_score function above), then apply it to the entire column. pandas handles the loop for you.

Sorting and Ranking

Python

# Sort by age
df_sorted = df.sort_values('age')

# Sort by multiple columns
df_sorted = df.sort_values(['age', 'income'], ascending=[True, False])

# Rank clients by income
df['income_rank'] = df['income'].rank()

# Sort by age
df_sorted = df.sort_values('age')

# Sort by multiple columns
df_sorted = df.sort_values(['age', 'income'], ascending=[True, False])

# Rank clients by income
df['income_rank'] = df['income'].rank()

Sorting is straightforward. The ascending parameter controls whether you sort low-to-high or high-to-low. You can sort by multiple columns (like “sort by age, then within each age group sort by income descending”).

Grouping and Aggregating

This is where pandas really shines. You can group data by categories and calculate statistics for each group:

Python

# Average income by age group
df.groupby('age_group')['income'].mean()

# Multiple statistics
df.groupby('age_group')['income'].agg(['mean', 'median', 'count'])

# Group by multiple columns
df.groupby(['age_group', 'service_type'])['sessions'].sum()

# Average income by age group
df.groupby('age_group')['income'].mean()

# Multiple statistics
df.groupby('age_group')['income'].agg(['mean', 'median', 'count'])

# Group by multiple columns
df.groupby(['age_group', 'service_type'])['sessions'].sum()

.groupby() splits your data into groups based on a column (like “group by service type”). Then you can calculate statistics for each group. This is exactly what pivot tables do in Excel, but more flexible and reproducible.

For example, if you want to know “what’s the average number of sessions per service type?” you would do:

Python

df.groupby('service_type')['sessions'].mean()

df.groupby('service_type')['sessions'].mean()

Output might be:

service_type
counseling       8.5
employment      12.3
housing          6.7

One line of code, clear result.

Merging Datasets

Often your data lives in multiple files. Client demographics in one file, service usage in another, outcomes in a third. You need to combine them:

Python

# Load two datasets
clients = pd.read_csv('clients.csv')
services = pd.read_csv('services.csv')

# Merge on client ID
combined = pd.merge(clients, services, on='client_id')

# Load two datasets
clients = pd.read_csv('clients.csv')
services = pd.read_csv('services.csv')

# Merge on client ID
combined = pd.merge(clients, services, on='client_id')

This is like a JOIN in SQL or a VLOOKUP in Excel, but cleaner. You specify which column to match on (client_id), and pandas combines the datasets. All the information about each client from both files ends up in one DataFrame.

Why This Matters: From Code to Insight

Everything we’ve covered—loading data, selecting rows, handling missing values, creating new columns, grouping—these are the building blocks of data analysis in social work. You use them constantly:

Program evaluation: Load client data, filter for participants in a specific program, calculate completion rates by demographic group
Needs assessment: Merge census data with service usage data, identify underserved populations
Grant reporting: Calculate aggregate statistics broken down by funding stream
Policy analysis: Load administrative data, clean it, transform it, prepare it for visualization

pandas makes this work systematic and reproducible. Write your code once, run it on updated data every month. Document your decisions (how you defined “completion,” how you handled missing data) right in your code. Share it with colleagues who can see exactly what you did.

pandas Quirks You Need to Know

Before we look at a complete example, some practical notes:

Copies vs. Views: When you select data from a DataFrame, sometimes you get a copy, sometimes you get a view. This matters when you try to modify the data:

Python

# This might not work as expected
df[df['age'] < 30]['income'] = 50000  # Warning: SettingWithCopyWarning

# Do this instead
df.loc[df['age'] < 30, 'income'] = 50000

# This might not work as expected
df[df['age'] < 30]['income'] = 50000  # Warning: SettingWithCopyWarning

# Do this instead
df.loc[df['age'] < 30, 'income'] = 50000

If you see SettingWithCopyWarning, use .loc[] for clarity.

Index matters (sometimes): Most of the time you can ignore the index. But when you filter or sort data, the index doesn’t automatically reset:

Python

# After filtering, index might be [0, 5, 7, 12...]
filtered = df[df['age'] < 30]

# Reset it if needed
filtered = filtered.reset_index(drop=True)

# After filtering, index might be [0, 5, 7, 12...]
filtered = df[df['age'] < 30]

# Reset it if needed
filtered = filtered.reset_index(drop=True)

Column names with spaces are annoying: If a column is named "Client Name" you have to use df['Client Name'] not df.Client Name. Better to rename:

Python

df.columns = df.columns.str.replace(' ', '_')

df.columns = df.columns.str.replace(' ', '_')

Now you can use df.client_name.

A Complete Example

Here’s realistic social work data analysis combining everything:

Python

import pandas as pd

# Load client data
df = pd.read_csv('program_participants.csv')

# Initial exploration
print(f"Dataset has {df.shape[0]} clients and {df.shape[1]} variables")
print(df.head())
print(df.info())

# Clean data
# Handle missing income data (fill with median)
df['income'].fillna(df['income'].median(), inplace=True)

# Remove duplicate records
df = df.drop_duplicates(subset='client_id')

# Create new variables
# Income eligibility
df['income_eligible'] = df['income'] < 25000

# Age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 40, 65, 100], 
                          labels=['18-25', '26-40', '41-65', '65+'])

# Program completion (attended 80% of sessions)
df['completed'] = (df['sessions_attended'] / df['sessions_total']) >= 0.8

# Analysis
# Overall completion rate
completion_rate = df['completed'].mean()
print(f"Overall completion rate: {completion_rate:.1%}")

# Completion by age group
completion_by_age = df.groupby('age_group')['completed'].mean()
print("\nCompletion rates by age group:")
print(completion_by_age)

# Average sessions by service type
sessions_by_service = df.groupby('service_type')['sessions_attended'].mean()
print("\nAverage sessions by service type:")
print(sessions_by_service)

# Demographics of completers vs non-completers
demo_comparison = df.groupby('completed')[['age', 'income']].mean()
print("\nDemographics comparison:")
print(demo_comparison)

# Save cleaned data for visualization
df.to_csv('program_data_cleaned.csv', index=False)

import pandas as pd

# Load client data
df = pd.read_csv('program_participants.csv')

# Initial exploration
print(f"Dataset has {df.shape[0]} clients and {df.shape[1]} variables")
print(df.head())
print(df.info())

# Clean data
# Handle missing income data (fill with median)
df['income'].fillna(df['income'].median(), inplace=True)

# Remove duplicate records
df = df.drop_duplicates(subset='client_id')

# Create new variables
# Income eligibility
df['income_eligible'] = df['income'] < 25000

# Age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 40, 65, 100], 
                          labels=['18-25', '26-40', '41-65', '65+'])

# Program completion (attended 80% of sessions)
df['completed'] = (df['sessions_attended'] / df['sessions_total']) >= 0.8

# Analysis
# Overall completion rate
completion_rate = df['completed'].mean()
print(f"Overall completion rate: {completion_rate:.1%}")

# Completion by age group
completion_by_age = df.groupby('age_group')['completed'].mean()
print("\nCompletion rates by age group:")
print(completion_by_age)

# Average sessions by service type
sessions_by_service = df.groupby('service_type')['sessions_attended'].mean()
print("\nAverage sessions by service type:")
print(sessions_by_service)

# Demographics of completers vs non-completers
demo_comparison = df.groupby('completed')[['age', 'income']].mean()
print("\nDemographics comparison:")
print(demo_comparison)

# Save cleaned data for visualization
df.to_csv('program_data_cleaned.csv', index=False)

This example shows a typical workflow: load data, explore it, clean it, create new variables, analyze patterns, save results. Each step is clear and documented. Someone else (or future you) can read this code and understand exactly what happened.

Moving Forward: Visualization

pandas prepares your data. Visualization tells the story. In the next post, we’ll introduce matplotlib and seaborn, Python’s visualization libraries. You’ll see how to take a cleaned DataFrame and create charts that communicate findings to stakeholders, funders, or community members.

The data work we’ve covered—filtering, grouping, aggregating—sets up your visualization. You don’t visualize raw data. You clean it, structure it, and calculate what you need. Then you visualize the result.

Resources

pandas documentation: https://pandas.pydata.org/docs/
pandas cheat sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
10 minutes to pandas: https://pandas.pydata.org/docs/user_guide/10min.html

In the next post, we’ll turn these DataFrames into visualizations that make patterns visible and arguments clear.

December 22, 2025

[Python] Introduction to pandas library for data analysis

Why pandas?

Your First DataFrame: Loading Data

Understanding DataFrames

Exploring Your Data

Selecting Data

Handling Missing Data

Creating New Columns

Sorting and Ranking

Grouping and Aggregating

Merging Datasets

Why This Matters: From Code to Insight

pandas Quirks You Need to Know

A Complete Example

Moving Forward: Visualization

Resources

Related Posts

Leave a ReplyCancel reply

Translate this page into:

Categories

[Python] Introduction to pandas library for data analysis

Why pandas?

Your First DataFrame: Loading Data

Understanding DataFrames

Exploring Your Data

Selecting Data

Handling Missing Data

Creating New Columns

Sorting and Ranking

Grouping and Aggregating

Merging Datasets

Why This Matters: From Code to Insight

pandas Quirks You Need to Know

A Complete Example

Moving Forward: Visualization

Resources

Share this:

Related Posts

Leave a ReplyCancel reply

Translate this page into:

Categories