[Python] Comparing Groups: Visualizing Distributions for categorical variables (matplotlib/seaborn)

In the previous post, we covered visualizing distributions of continuous outcomes across groups: box plots, violin plots, and strip plots for comparing measures like PHQ-9 scores or service hours between different client populations. But many outcomes in social science research are not continuous.

Let’s think about some examples. Did the client complete treatment? What was their discharge status? Which service pathway did they follow? These are categorical outcomes, and visualizing them requires different tools.

This post covers how to visualize the distribution of categorical variables, compare categorical outcomes across groups, and examine associations between two categorical variables. We’ll use seaborn’s countplot(), grouped and stacked bar charts, crosstabs, and heatmaps.

Understanding Categorical Variables

Categorical outcomes (also called nominal or qualitative outcomes) are variables where values fall into distinct categories rather than a numerical scale. In social work data, you’ll encounter these constantly.

Binary outcomes have exactly two categories: completed/not completed, eligible/ineligible, housed/unhoused, screened positive/negative.
Nominal outcomes have multiple unordered categories: discharge status (completed, dropped out, transferred, referred elsewhere), service type (case management, counseling, crisis intervention), referral source (self, family, court, hospital).
Ordinal outcomes have multiple ordered categories: satisfaction rating (very dissatisfied, dissatisfied, neutral, satisfied, very satisfied), severity level (mild, moderate, severe). Ordinal data sits between categorical and continuous; for visualization purposes, we often treat it as categorical.

The goal with categorical outcomes is usually to show how many observations fall into each category, what proportion of a group has a particular outcome, or whether there’s an association between two categorical variables.

Counting Categories with countplot()

The most basic task is counting how many observations fall into each category. Seaborn’s countplot() does this automatically.

Python

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Create mock data: client discharge status
np.random.seed(42)
n = 200

data = pd.DataFrame({
    'discharge_status': np.random.choice(
        ['Completed', 'Dropped Out', 'Transferred', 'Referred'],
        size=n,
        p=[0.45, 0.30, 0.15, 0.10]
    )
})

plt.figure(figsize=(8, 5))
sns.countplot(data=data, x='discharge_status',
              order=['Completed', 'Dropped Out', 'Transferred', 'Referred'],
              palette=['#0077BB', '#EE7733', '#009988', '#CC3311'])

plt.xlabel('Discharge Status', fontsize=11)
plt.ylabel('Number of Clients', fontsize=11)
plt.title('Client Discharge Status', fontsize=13)
plt.tight_layout()
plt.show()

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Create mock data: client discharge status
np.random.seed(42)
n = 200

data = pd.DataFrame({
    'discharge_status': np.random.choice(
        ['Completed', 'Dropped Out', 'Transferred', 'Referred'],
        size=n,
        p=[0.45, 0.30, 0.15, 0.10]
    )
})

plt.figure(figsize=(8, 5))
sns.countplot(data=data, x='discharge_status',
              order=['Completed', 'Dropped Out', 'Transferred', 'Referred'],
              palette=['#0077BB', '#EE7733', '#009988', '#CC3311'])

plt.xlabel('Discharge Status', fontsize=11)
plt.ylabel('Number of Clients', fontsize=11)
plt.title('Client Discharge Status', fontsize=13)
plt.tight_layout()
plt.show()

The order parameter controls the sequence of bars. Without it, seaborn will order categories alphabetically or by first appearance in the data. For discharge status, a meaningful order (completed first, then less desirable outcomes) helps viewers interpret the results.

Comparing Counts Across Groups with Hue

When you want to compare categorical outcomes across different groups, add a hue parameter to split each category by another variable.

Python

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Create mock data: discharge status by program type
np.random.seed(42)
n = 300

data = pd.DataFrame({
    'program': np.repeat(['Intensive', 'Standard'], n // 2),
    'discharge_status': np.concatenate([
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=n // 2, p=[0.55, 0.30, 0.15]),
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=n // 2, p=[0.40, 0.40, 0.20])
    ])
})

plt.figure(figsize=(9, 5))
sns.countplot(data=data, x='discharge_status', hue='program',
              order=['Completed', 'Dropped Out', 'Transferred'],
              palette=['#0077BB', '#EE7733'])

plt.xlabel('Discharge Status', fontsize=11)
plt.ylabel('Number of Clients', fontsize=11)
plt.title('Discharge Status by Program Type', fontsize=13)
plt.legend(title='Program')
plt.tight_layout()
plt.show()

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Create mock data: discharge status by program type
np.random.seed(42)
n = 300

data = pd.DataFrame({
    'program': np.repeat(['Intensive', 'Standard'], n // 2),
    'discharge_status': np.concatenate([
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=n // 2, p=[0.55, 0.30, 0.15]),
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=n // 2, p=[0.40, 0.40, 0.20])
    ])
})

plt.figure(figsize=(9, 5))
sns.countplot(data=data, x='discharge_status', hue='program',
              order=['Completed', 'Dropped Out', 'Transferred'],
              palette=['#0077BB', '#EE7733'])

plt.xlabel('Discharge Status', fontsize=11)
plt.ylabel('Number of Clients', fontsize=11)
plt.title('Discharge Status by Program Type', fontsize=13)
plt.legend(title='Program')
plt.tight_layout()
plt.show()

This grouped bar chart shows raw counts: how many clients in each program had each discharge status. But raw counts can be misleading when group sizes differ. If the Intensive program served 200 clients and the Standard program served 100, we’d expect higher counts across all categories for Intensive, even if the proportions were identical.

Proportions Over Counts: Normalized Bar Charts

Often the question isn’t “how many” but “what proportion.” What percentage of clients in each program completed treatment? This requires calculating proportions before plotting.

Seaborn doesn’t have a built-in normalization option for countplot, so we need to calculate proportions ourselves using pandas.

Python

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Create mock data with different group sizes
np.random.seed(42)

intensive = pd.DataFrame({
    'program': 'Intensive',
    'discharge_status': np.random.choice(
        ['Completed', 'Dropped Out', 'Transferred'],
        size=180, p=[0.55, 0.30, 0.15])
})

standard = pd.DataFrame({
    'program': 'Standard',
    'discharge_status': np.random.choice(
        ['Completed', 'Dropped Out', 'Transferred'],
        size=120, p=[0.40, 0.40, 0.20])
})

data = pd.concat([intensive, standard], ignore_index=True)

# Calculate proportions within each program
proportions = (data.groupby(['program', 'discharge_status'])
               .size()
               .reset_index(name='count'))

# Add total per program and calculate proportion
totals = proportions.groupby('program')['count'].transform('sum')
proportions['proportion'] = proportions['count'] / totals

plt.figure(figsize=(9, 5))
sns.barplot(data=proportions, x='discharge_status', y='proportion', hue='program',
            order=['Completed', 'Dropped Out', 'Transferred'],
            palette=['#0077BB', '#EE7733'])

plt.xlabel('Discharge Status', fontsize=11)
plt.ylabel('Proportion of Clients', fontsize=11)
plt.title('Discharge Status by Program Type (Proportions)', fontsize=13)
plt.legend(title='Program')
plt.ylim(0, 0.7)
plt.tight_layout()
plt.show()

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Create mock data with different group sizes
np.random.seed(42)

intensive = pd.DataFrame({
    'program': 'Intensive',
    'discharge_status': np.random.choice(
        ['Completed', 'Dropped Out', 'Transferred'],
        size=180, p=[0.55, 0.30, 0.15])
})

standard = pd.DataFrame({
    'program': 'Standard',
    'discharge_status': np.random.choice(
        ['Completed', 'Dropped Out', 'Transferred'],
        size=120, p=[0.40, 0.40, 0.20])
})

data = pd.concat([intensive, standard], ignore_index=True)

# Calculate proportions within each program
proportions = (data.groupby(['program', 'discharge_status'])
               .size()
               .reset_index(name='count'))

# Add total per program and calculate proportion
totals = proportions.groupby('program')['count'].transform('sum')
proportions['proportion'] = proportions['count'] / totals

plt.figure(figsize=(9, 5))
sns.barplot(data=proportions, x='discharge_status', y='proportion', hue='program',
            order=['Completed', 'Dropped Out', 'Transferred'],
            palette=['#0077BB', '#EE7733'])

plt.xlabel('Discharge Status', fontsize=11)
plt.ylabel('Proportion of Clients', fontsize=11)
plt.title('Discharge Status by Program Type (Proportions)', fontsize=13)
plt.legend(title='Program')
plt.ylim(0, 0.7)
plt.tight_layout()
plt.show()

Proportions make the comparison fair regardless of group size. We can now see that a higher proportion of Intensive program clients completed treatment compared to Standard program clients.

Stacked Bar Charts: Part-to-Whole Comparisons

Stacked bar charts show how categories compose the whole for each group. Each bar represents one group (like a program), and the segments within the bar represent the categories (like discharge status).

Seaborn doesn’t have a native stacked bar chart function, but we can create one using pandas plotting with seaborn styling.

Python

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Create mock data
np.random.seed(42)

data = pd.DataFrame({
    'program': np.repeat(['Intensive', 'Standard', 'Brief'], 100),
    'discharge_status': np.concatenate([
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=100, p=[0.55, 0.30, 0.15]),
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=100, p=[0.40, 0.40, 0.20]),
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=100, p=[0.50, 0.35, 0.15])
    ])
})

# Create crosstab and normalize by row (program)
ct = pd.crosstab(data['program'], data['discharge_status'], normalize='index')
ct = ct[['Completed', 'Dropped Out', 'Transferred']]  # Reorder columns

# Set seaborn style
sns.set_style('whitegrid')

# Plot stacked bar chart
ax = ct.plot(kind='bar', stacked=True, 
             color=['#0077BB', '#EE7733', '#009988'],
             figsize=(8, 5),
             edgecolor='white',
             linewidth=1)

plt.xlabel('Program Type', fontsize=11)
plt.ylabel('Proportion of Clients', fontsize=11)
plt.title('Discharge Status Composition by Program', fontsize=13)
plt.legend(title='Discharge Status', bbox_to_anchor=(1.02, 1), loc='upper left')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Create mock data
np.random.seed(42)

data = pd.DataFrame({
    'program': np.repeat(['Intensive', 'Standard', 'Brief'], 100),
    'discharge_status': np.concatenate([
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=100, p=[0.55, 0.30, 0.15]),
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=100, p=[0.40, 0.40, 0.20]),
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=100, p=[0.50, 0.35, 0.15])
    ])
})

# Create crosstab and normalize by row (program)
ct = pd.crosstab(data['program'], data['discharge_status'], normalize='index')
ct = ct[['Completed', 'Dropped Out', 'Transferred']]  # Reorder columns

# Set seaborn style
sns.set_style('whitegrid')

# Plot stacked bar chart
ax = ct.plot(kind='bar', stacked=True, 
             color=['#0077BB', '#EE7733', '#009988'],
             figsize=(8, 5),
             edgecolor='white',
             linewidth=1)

plt.xlabel('Program Type', fontsize=11)
plt.ylabel('Proportion of Clients', fontsize=11)
plt.title('Discharge Status Composition by Program', fontsize=13)
plt.legend(title='Discharge Status', bbox_to_anchor=(1.02, 1), loc='upper left')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

Stacked bar charts make it easy to compare the overall composition across groups. The Intensive program bar shows more “Completed” (blue) than the Standard program bar. But comparing the middle segments (Dropped Out) is harder because they don’t share a common baseline. This is a known limitation of stacked bar charts.

100% Stacked Bar Charts

A variation is the 100% stacked bar chart, where all bars are the same height and show proportions. We already did this above by using normalize='index' in the crosstab. If you want counts instead, remove that parameter.

Python

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Create mock data with different group sizes
np.random.seed(42)

data = pd.DataFrame({
    'program': np.concatenate([
        np.repeat('Intensive', 180),
        np.repeat('Standard', 120),
        np.repeat('Brief', 60)
    ]),
    'discharge_status': np.concatenate([
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=180, p=[0.55, 0.30, 0.15]),
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=120, p=[0.40, 0.40, 0.20]),
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=60, p=[0.50, 0.35, 0.15])
    ])
})

# Crosstab with raw counts (not normalized)
ct_counts = pd.crosstab(data['program'], data['discharge_status'])
ct_counts = ct_counts[['Completed', 'Dropped Out', 'Transferred']]

sns.set_style('whitegrid')

ax = ct_counts.plot(kind='bar', stacked=True,
                     color=['#0077BB', '#EE7733', '#009988'],
                     figsize=(8, 5),
                     edgecolor='white',
                     linewidth=1)

plt.xlabel('Program Type', fontsize=11)
plt.ylabel('Number of Clients', fontsize=11)
plt.title('Discharge Status by Program (Counts)', fontsize=13)
plt.legend(title='Discharge Status', bbox_to_anchor=(1.02, 1), loc='upper left')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Create mock data with different group sizes
np.random.seed(42)

data = pd.DataFrame({
    'program': np.concatenate([
        np.repeat('Intensive', 180),
        np.repeat('Standard', 120),
        np.repeat('Brief', 60)
    ]),
    'discharge_status': np.concatenate([
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=180, p=[0.55, 0.30, 0.15]),
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=120, p=[0.40, 0.40, 0.20]),
        np.random.choice(['Completed', 'Dropped Out', 'Transferred'],
                         size=60, p=[0.50, 0.35, 0.15])
    ])
})

# Crosstab with raw counts (not normalized)
ct_counts = pd.crosstab(data['program'], data['discharge_status'])
ct_counts = ct_counts[['Completed', 'Dropped Out', 'Transferred']]

sns.set_style('whitegrid')

ax = ct_counts.plot(kind='bar', stacked=True,
                     color=['#0077BB', '#EE7733', '#009988'],
                     figsize=(8, 5),
                     edgecolor='white',
                     linewidth=1)

plt.xlabel('Program Type', fontsize=11)
plt.ylabel('Number of Clients', fontsize=11)
plt.title('Discharge Status by Program (Counts)', fontsize=13)
plt.legend(title='Discharge Status', bbox_to_anchor=(1.02, 1), loc='upper left')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

With counts, bar heights reflect group sizes. The Intensive bar is taller than Brief because more clients were served. This can be informative when you want to show both composition and volume.

Crosstabs: The Foundation for Categorical Analysis

Before visualizing, it’s useful to create a crosstab (cross-tabulation), which is simply a table showing the frequency of each combination of two categorical variables.

Python

import pandas as pd
import numpy as np

# Create mock data
np.random.seed(42)

data = pd.DataFrame({
    'referral_source': np.random.choice(
        ['Self', 'Family', 'Court', 'Hospital'],
        size=400, p=[0.30, 0.25, 0.25, 0.20]),
    'completed': np.random.choice(
        ['Yes', 'No'],
        size=400, p=[0.50, 0.50])
})

# Basic crosstab: counts
ct = pd.crosstab(data['referral_source'], data['completed'])
print("Counts:")
print(ct)
print()

# Crosstab with row proportions
ct_row = pd.crosstab(data['referral_source'], data['completed'], normalize='index')
print("Row proportions (what % of each referral source completed):")
print(ct_row.round(3))
print()

# Crosstab with column proportions
ct_col = pd.crosstab(data['referral_source'], data['completed'], normalize='columns')
print("Column proportions (what % of completers came from each source):")
print(ct_col.round(3))

import pandas as pd
import numpy as np

# Create mock data
np.random.seed(42)

data = pd.DataFrame({
    'referral_source': np.random.choice(
        ['Self', 'Family', 'Court', 'Hospital'],
        size=400, p=[0.30, 0.25, 0.25, 0.20]),
    'completed': np.random.choice(
        ['Yes', 'No'],
        size=400, p=[0.50, 0.50])
})

# Basic crosstab: counts
ct = pd.crosstab(data['referral_source'], data['completed'])
print("Counts:")
print(ct)
print()

# Crosstab with row proportions
ct_row = pd.crosstab(data['referral_source'], data['completed'], normalize='index')
print("Row proportions (what % of each referral source completed):")
print(ct_row.round(3))
print()

# Crosstab with column proportions
ct_col = pd.crosstab(data['referral_source'], data['completed'], normalize='columns')
print("Column proportions (what % of completers came from each source):")
print(ct_col.round(3))

Row proportions answer: “Of clients referred from court, what proportion completed treatment?” Column proportions answer: “Of clients who completed treatment, what proportion were court-referred?”

These are different questions, and the choice depends on what you’re trying to understand.

Heatmaps for Crosstabs

When you have many categories, a heatmap can be more readable than a bar chart. The color intensity shows the magnitude of each cell.

Python

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Create mock data: service type by region
np.random.seed(42)

data = pd.DataFrame({
    'region': np.random.choice(
        ['North', 'South', 'East', 'West', 'Central'],
        size=500),
    'service_type': np.random.choice(
        ['Case Management', 'Counseling', 'Crisis Intervention', 
         'Housing Support', 'Employment Services'],
        size=500)
})

# Create crosstab
ct = pd.crosstab(data['service_type'], data['region'])

plt.figure(figsize=(9, 6))
sns.heatmap(ct, annot=True, fmt='d', cmap='Blues',
            linewidths=0.5, linecolor='white')

plt.xlabel('Region', fontsize=11)
plt.ylabel('Service Type', fontsize=11)
plt.title('Service Utilization by Region', fontsize=13)
plt.tight_layout()
plt.show()

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Create mock data: service type by region
np.random.seed(42)

data = pd.DataFrame({
    'region': np.random.choice(
        ['North', 'South', 'East', 'West', 'Central'],
        size=500),
    'service_type': np.random.choice(
        ['Case Management', 'Counseling', 'Crisis Intervention', 
         'Housing Support', 'Employment Services'],
        size=500)
})

# Create crosstab
ct = pd.crosstab(data['service_type'], data['region'])

plt.figure(figsize=(9, 6))
sns.heatmap(ct, annot=True, fmt='d', cmap='Blues',
            linewidths=0.5, linecolor='white')

plt.xlabel('Region', fontsize=11)
plt.ylabel('Service Type', fontsize=11)
plt.title('Service Utilization by Region', fontsize=13)
plt.tight_layout()
plt.show()

The annot=True parameter displays the count in each cell. fmt='d' formats these as integers. The color gradient helps viewers quickly spot which combinations have high or low counts.

For proportions, use a normalized crosstab:

Python

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

np.random.seed(42)

data = pd.DataFrame({
    'region': np.random.choice(
        ['North', 'South', 'East', 'West', 'Central'],
        size=500),
    'service_type': np.random.choice(
        ['Case Management', 'Counseling', 'Crisis Intervention', 
         'Housing Support', 'Employment Services'],
        size=500)
})

# Normalized by column: within each region, what's the service distribution?
ct_norm = pd.crosstab(data['service_type'], data['region'], normalize='columns')

plt.figure(figsize=(9, 6))
sns.heatmap(ct_norm, annot=True, fmt='.1%', cmap='Blues',
            linewidths=0.5, linecolor='white')

plt.xlabel('Region', fontsize=11)
plt.ylabel('Service Type', fontsize=11)
plt.title('Service Distribution Within Each Region', fontsize=13)
plt.tight_layout()
plt.show()

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

np.random.seed(42)

data = pd.DataFrame({
    'region': np.random.choice(
        ['North', 'South', 'East', 'West', 'Central'],
        size=500),
    'service_type': np.random.choice(
        ['Case Management', 'Counseling', 'Crisis Intervention', 
         'Housing Support', 'Employment Services'],
        size=500)
})

# Normalized by column: within each region, what's the service distribution?
ct_norm = pd.crosstab(data['service_type'], data['region'], normalize='columns')

plt.figure(figsize=(9, 6))
sns.heatmap(ct_norm, annot=True, fmt='.1%', cmap='Blues',
            linewidths=0.5, linecolor='white')

plt.xlabel('Region', fontsize=11)
plt.ylabel('Service Type', fontsize=11)
plt.title('Service Distribution Within Each Region', fontsize=13)
plt.tight_layout()
plt.show()

The fmt='.1%' formats values as percentages with one decimal place.

Testing for Association: Chi-Square Test

When examining two categorical variables, a natural question is whether they’re associated. Does discharge status differ by program type, or are the differences we see just random variation?

The chi-square test of independence answers this question. It compares observed frequencies to what we’d expect if the two variables were independent.

Python

import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

# Create mock data
np.random.seed(42)

data = pd.DataFrame({
    'program': np.repeat(['Intensive', 'Standard'], 150),
    'completed': np.concatenate([
        np.random.choice(['Yes', 'No'], size=150, p=[0.60, 0.40]),
        np.random.choice(['Yes', 'No'], size=150, p=[0.45, 0.55])
    ])
})

# Create contingency table
ct = pd.crosstab(data['program'], data['completed'])
print("Contingency Table:")
print(ct)
print()

# Perform chi-square test
chi2, p_value, dof, expected = chi2_contingency(ct)

print(f"Chi-square statistic: {chi2:.3f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p_value:.4f}")
print()
print("Expected frequencies (if independent):")
print(pd.DataFrame(expected, 
                   index=ct.index, 
                   columns=ct.columns).round(1))

import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

# Create mock data
np.random.seed(42)

data = pd.DataFrame({
    'program': np.repeat(['Intensive', 'Standard'], 150),
    'completed': np.concatenate([
        np.random.choice(['Yes', 'No'], size=150, p=[0.60, 0.40]),
        np.random.choice(['Yes', 'No'], size=150, p=[0.45, 0.55])
    ])
})

# Create contingency table
ct = pd.crosstab(data['program'], data['completed'])
print("Contingency Table:")
print(ct)
print()

# Perform chi-square test
chi2, p_value, dof, expected = chi2_contingency(ct)

print(f"Chi-square statistic: {chi2:.3f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p_value:.4f}")
print()
print("Expected frequencies (if independent):")
print(pd.DataFrame(expected, 
                   index=ct.index, 
                   columns=ct.columns).round(1))

A small p-value (typically < 0.05) suggests the variables are associated. The expected frequencies show what we’d see if program type had no relationship with completion.

Note that chi-square tells you whether an association exists, not how strong it is or what direction it takes. For that, you need to look at the actual proportions or use measures like Cramér’s V.

A Note on Sample Size and Statistical Significance. Statistical significance depends heavily on sample size. With a large enough sample, even tiny, practically meaningless differences become “statistically significant.” Conversely, with small samples, real and meaningful differences might not reach significance. When interpreting chi-square tests or any statistical test with categorical data, always look at the actual proportions. A statistically significant association between program type and completion is only meaningful if the difference in completion rates (say, 60% vs. 45%) matters for your clients and your organization.

Choosing the Right Visualization

Here’s a guide for selecting the appropriate visualization for your categorical data:

Question	Visualization
How many observations in each category?	countplot or bar chart
How do category counts compare across groups?	Grouped bar chart (countplot with hue)
What proportion of each group falls into each category?	Normalized grouped bar chart or 100% stacked bar
How does composition differ across groups?	Stacked bar chart
How are two categorical variables related? (many categories)	Heatmap of crosstab
Where is the association concentrated?	Heatmap of residuals

Resources

Seaborn categorical tutorial: https://seaborn.pydata.org/tutorial/categorical.html

Seaborn countplot documentation: https://seaborn.pydata.org/generated/seaborn.countplot.html

Pandas crosstab documentation: https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html

Scipy chi-square test: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

Statsmodels mosaic plot (for advanced categorical visualization): https://www.statsmodels.org/stable/generated/statsmodels.graphics.mosaicplot.mosaic.html

January 5, 2026

[Python] Visualizing Change Over Time: Temporal Data (Time Series Analysis)

January 26, 2026

[…] posts, we covered visualizing continuous outcomes across groups (box plots, violin plots) and categorical outcomes (count plots, stacked bars, crosstabs). Many questions in social science research involve a third […]

[Python] Comparing Groups: Visualizing Distributions for categorical variables (matplotlib/seaborn)

Understanding Categorical Variables

Counting Categories with countplot()

Comparing Counts Across Groups with Hue

Proportions Over Counts: Normalized Bar Charts

Stacked Bar Charts: Part-to-Whole Comparisons

100% Stacked Bar Charts

Crosstabs: The Foundation for Categorical Analysis

Heatmaps for Crosstabs

Testing for Association: Chi-Square Test

Choosing the Right Visualization

Resources

Related Posts

1 Response

Leave a ReplyCancel reply

Translate this page into:

Categories

[Python] Comparing Groups: Visualizing Distributions for categorical variables (matplotlib/seaborn)

Understanding Categorical Variables

Counting Categories with countplot()

Comparing Counts Across Groups with Hue

Proportions Over Counts: Normalized Bar Charts

Stacked Bar Charts: Part-to-Whole Comparisons

100% Stacked Bar Charts

Crosstabs: The Foundation for Categorical Analysis

Heatmaps for Crosstabs

Testing for Association: Chi-Square Test

Choosing the Right Visualization

Resources

Share this:

Related Posts

1 Response

Leave a ReplyCancel reply

Translate this page into:

Categories