[Python] GIS (2): Working with the Census API in Python / American Community Survey datasets

For most cases of geographic mapping in social science research, we need demographic data: who lives where and what their socioeconomic circumstances are. The U.S. Census Bureau collects this information via American Community Survey and makes it freely available through an Application Programming Interface (API).

This ability to connect service locations with population characteristics reflects a core principle in social work: understanding context. The ecological systems framework (Bronfenbrenner, 1979) positions neighborhood and community characteristics as critical determinants of individual outcomes. More recently, scholars have documented how spatial mismatch between service supply and population need contributes to health disparities (See my paper here). Census data provides the empirical foundation for this kind of spatial equity analysis.

Social work research has increasingly recognized the role of place in shaping outcomes. Sampson’s work on neighborhood effects (2012) demonstrated that community-level factors like concentrated poverty and social cohesion independently predict health and wellbeing beyond individual characteristics. Scholars studying mental health disparities have found that geographic access to care, measured in travel time or distance, significantly predicts service utilization, particularly for low-income populations and communities of color (Cummings et al., 2013).

This post covers how to access Census data programmatically (easily) using Python. By the end, you’ll be able to request specific variables for specific geographies, merge that data with shapefiles, and create community profiles that inform needs assessments and program planning.

Introduction to Census Data

The Census Bureau conducts two main data collection efforts relevant to this research.

  1. The Decennial Census, conducted every ten years, attempts to count every person living in the United States. It provides the most accurate population counts but asks only a handful of questions.
  2. American Community Survey (ACS), conducted continuously and released annually, samples a portion of the population but asks detailed questions about income, education, employment, language use, disability status, housing, and more.

For social work research, the ACS is typically more useful because it contains the variables we care about: poverty rates, health insurance coverage, English proficiency, educational attainment, and similar indicators. The tradeoff is that ACS data comes with margins of error because it’s based on a sample rather than a complete count. For small geographies like census tracts, these margins of error can be substantial (Spielman et al., 2014), something to keep in mind when interpreting results.

The ACS releases data in three forms: 1-year, 3-year, and 5-year estimates. Each represents a tradeoff between currency and reliability.

Estimate TypeData Collection PeriodAvailable ForBest Used When
1-yearSingle yearAreas with 65,000+ populationYou need the most current data and are analyzing large geographies (states, large counties, big cities)
5-yearFive years combinedAll areas, including census tractsYou need small-area data (tracts, block groups) or are analyzing smaller communities

For neighborhood-level analysis using census tracts, the 5-year estimates are your only option. The tradeoff is that 5-year estimates blend data across a longer time period, which can obscure recent changes. If you’re studying a neighborhood that gentrified rapidly, the 5-year estimate smooths across pre- and post-gentrification conditions.

The margins of error also differ. Spielman and colleagues (2014) found that ACS margins of error at the tract level can be substantial, sometimes exceeding 20-30% of the estimate itself for small populations. When comparing tracts, always check whether differences exceed the margins of error before drawing conclusions.

Census data is organized hierarchically by geography. The nation contains states, states contain counties, counties contain census tracts, and tracts contain block groups. Each level nests perfectly within the level above it. Census tracts are designed to contain roughly 4,000 people and are the most common unit for neighborhood-level analysis. They’re small enough to capture local variation but large enough to have reasonably reliable ACS estimates.

Setting Up Census API Access

To use the Census API, you need a free API key. Visit https://api.census.gov/data/key_signup.html and request one. You’ll receive the key by email within a few minutes.

In Python, the census package provides a clean interface to the API. Install it along with us, a helper package for state codes:

Python
!pip install census us

Then set up your connection:

Python
from census import Census
import us

# Replace with your actual API key
api_key = 'YOUR_API_KEY_HERE'
c = Census(api_key)

Understanding Census Variable Codes

Census data is organized into thousands of variables, each with a unique code. Finding the right code is often the hardest part of working with Census data. The codes follow a pattern: a prefix indicating the table, followed by numbers indicating the specific variable.

For ACS data, common prefixes include B01001 (age and sex), B19013 (median household income), B17001 (poverty status), and C27001 (health insurance coverage). The letter prefix indicates the table type: B tables (Base) provide the most detailed breakdowns, C tables (Collapsed) combine categories for smaller margins of error, and S tables (Subject) present pre-calculated percentages and summary statistics. For most research purposes, B tables offer the granularity you need, but C tables are useful when sample sizes are small.

The full variable code includes an underscore and a suffix, like B19013_001E, where E indicates an estimate (as opposed to M for margin of error).

The Census Bureau provides a complete list of variables at https://api.census.gov/data/2023/acs/acs5/variables.html (the year can be changed as you see fit). This page is searchable and includes descriptions of what each variable measures. If it loads too slow, you can download and use this excel sheet.

Here are some variables commonly used in social science research:

  • Total population
  • Median household income
  • Population below poverty level
  • No health insurance coverage
  • Renter-occupied housing units

Requesting Data for Specific Geographies

Let’s request median household income for all census tracts in Washtenaw County, Michigan. Each geography requires specific codes: Michigan’s state FIPS code is 26, and Washtenaw County’s code is 161.

Python
from census import Census

api_key = 'YOUR_API_KEY_HERE'
c = Census(api_key)

# Request median household income for Washtenaw County census tracts
# Using 2022 ACS 5-year estimates
washtenaw_income = c.acs5.state_county_tract(
    fields=['NAME', 'B19013_001E'],
    state_fips='26',
    county_fips='161',
    tract='*'  # asterisk means "all tracts"
)

# Convert to DataFrame
import pandas as pd
df = pd.DataFrame(washtenaw_income)
df.columns = ['name', 'median_income', 'state', 'county', 'tract']
print(df.head())

The state_county_tract method requests data at the tract level. The fields parameter specifies which variables to retrieve. The asterisk in tract='*' means “give me all tracts in this county.”

The result is a list of dictionaries that we convert to a pandas DataFrame. Each row represents one census tract, with columns for the tract name, the requested variable, and geographic identifiers.

Building a GEOID for Merging with Shapefiles

Census shapefiles identify each geographic unit with a GEOID, a concatenation of state, county, and tract codes. To merge our Census data with a shapefile, we need to construct matching GEOIDs:

Python
# Create GEOID by concatenating state + county + tract
df['GEOID'] = df['state'] + df['county'] + df['tract']
print(df[['name', 'median_income', 'GEOID']].head())

The GEOID for a tract in Washtenaw County looks like “26161400100”, where 26 is Michigan, 161 is Washtenaw County, and 400100 is the tract number.

Merging Census Data with Shapefiles

Now we can combine our Census data with geographic boundaries:

Python
import geopandas as gpd

# Download tract boundaries for Michigan
tracts_url = "https://www2.census.gov/geo/tiger/TIGER2023/TRACT/tl_2023_26_tract.zip"
michigan_tracts = gpd.read_file(tracts_url)

# Filter to Washtenaw County
washtenaw_tracts = michigan_tracts[michigan_tracts['COUNTYFP'] == '161'].copy()

# Merge Census data with shapefile
# The shapefile uses 'GEOID' as the column name
washtenaw_merged = washtenaw_tracts.merge(df, on='GEOID', how='left')

print(washtenaw_merged[['GEOID', 'median_income']].head())

The merge function joins the two datasets on the GEOID column. We use how='left' to keep all tracts from the shapefile even if some don’t have matching Census data.

Creating a Community Data Profile

Let’s build a more complete community profile by requesting multiple variables:

Python
from census import Census
import pandas as pd
import geopandas as gpd

api_key = 'YOUR_API_KEY_HERE'
c = Census(api_key)

# Request multiple variables for Detroit metro counties
# Wayne=163, Oakland=125, Macomb=099, Washtenaw=161
metro_data = []

for county_fips in ['163', '125', '099', '161']:
    data = c.acs5.state_county_tract(
        fields=[
            'NAME',
            'B01001_001E',  # Total population
            'B19013_001E',  # Median household income
            'B17001_002E',  # Population below poverty
            'B06007_002E',  # Speak only English
            'B06007_003E',  # Speak Spanish
        ],
        state_fips='26',
        county_fips=county_fips,
        tract='*'
    )
    metro_data.extend(data)

# Convert to DataFrame
df = pd.DataFrame(metro_data)
df.columns = ['name', 'total_pop', 'median_income', 'poverty_pop', 
              'english_only', 'spanish_speakers', 'state', 'county', 'tract']

# Create GEOID
df['GEOID'] = df['state'] + df['county'] + df['tract']

# Convert to numeric
for col in ['total_pop', 'median_income', 'poverty_pop', 'english_only', 'spanish_speakers']:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Calculate derived measures
df['poverty_rate'] = (df['poverty_pop'] / df['total_pop'] * 100).round(1)
df['non_english'] = df['total_pop'] - df['english_only']
df['non_english_rate'] = (df['non_english'] / df['total_pop'] * 100).round(1)
df['spanish_rate'] = (df['spanish_speakers'] / df['total_pop'] * 100).round(1)

print(df[['name', 'total_pop', 'median_income', 'poverty_rate', 'non_english_rate', 'spanish_rate']].head(10))

This code loops through four counties in the Detroit metro area, requesting population, income, poverty, and language data for each tract. We then calculate derived measures like poverty rate and the percentage of residents with limited English proficiency (LEP).

Visualizing Census Data on Maps

Now we can create choropleth maps showing how these variables are distributed:

Python
import geopandas as gpd
import matplotlib.pyplot as plt
import contextily as cx

# Load and filter tract boundaries
tracts_url = "https://www2.census.gov/geo/tiger/TIGER2023/TRACT/tl_2023_26_tract.zip"
michigan_tracts = gpd.read_file(tracts_url)

metro_counties = ['163', '125', '099', '161']
metro_tracts = michigan_tracts[michigan_tracts['COUNTYFP'].isin(metro_counties)].copy()

# Merge with Census data
metro_merged = metro_tracts.merge(df, on='GEOID', how='left')

# Handle missing and invalid values
# Census uses negative values (like -666666666) to indicate missing data
metro_merged['median_income'] = pd.to_numeric(metro_merged['median_income'], errors='coerce')
metro_merged.loc[metro_merged['median_income'] < 0, 'median_income'] = None

# Convert to Web Mercator for basemap
metro_merged = metro_merged.to_crs(epsg=3857)

# Create map of median household income
fig, ax = plt.subplots(figsize=(12, 12))

metro_merged.plot(
    column='median_income',
    ax=ax,
    legend=True,
    legend_kwds={'label': 'Median Household Income ($)', 'shrink': 0.6},
    cmap='Blues',
    edgecolor='white',
    linewidth=0.2,
    missing_kwds={'color': 'lightgray', 'label': 'No data'}
)

cx.add_basemap(ax, source=cx.providers.CartoDB.PositronNoLabels, alpha=0.5)

ax.set_title('Median Household Income by Census Tract\nDetroit Metro Area (2022 ACS 5-Year Estimates)', fontsize=14)
ax.set_axis_off()
plt.tight_layout()
plt.show()

A few notes on the code above. The Census Bureau uses special codes like -666666666 to indicate that data is unavailable for a tract, often because the sample size is too small for a reliable estimate. We convert these to None so they display as gray (“No data”) rather than throwing off the color scale. The missing_kwds parameter in GeoPandas controls how these missing values appear on the map.

The column parameter tells GeoPandas which variable to use for coloring the map. The cmap parameter specifies the color scheme. The missing_kwds parameter handles tracts where data is missing.

Collecting Data Across Multiple Years

The examples above request data for a single year. But we often requires tracking changes over time: how has poverty shifted? Are language access gaps widening or narrowing? The Census API lets you request historical ACS data by changing the year in your API call. Python command – for loop – can make your life easier too.

Here’s how to collect state-level SNAP (food stamps) data across multiple years using the requests library directly:

Python
import requests
import pandas as pd

API_KEY = 'YOUR_API_KEY_HERE'
YEARS = range(2013, 2023)  # ACS 5-year available from 2009
VARIABLES = "NAME,B22001_002E,B01001_001E"  # SNAP recipients, total population

all_data = []

for year in YEARS:
    url = f"https://api.census.gov/data/{year}/acs/acs5"
    params = {
        "get": VARIABLES,
        "for": "state:*",
        "key": API_KEY
    }
    
    response = requests.get(url, params=params)
    
    if response.status_code == 200:
        data = response.json()
        df = pd.DataFrame(data[1:], columns=data[0])
        df["year"] = year
        all_data.append(df)
        print(f"{year} done")
    else:
        print(f"{year} failed: {response.status_code}")

# Combine all years
result = pd.concat(all_data, ignore_index=True)

# Clean up column names and types
result = result.rename(columns={
    "NAME": "state_name",
    "state": "state_fips",
    "B22001_002E": "snap_recipients",
    "B01001_001E": "total_pop"
})
result[["snap_recipients", "total_pop"]] = result[["snap_recipients", "total_pop"]].apply(pd.to_numeric)
result["pct_snap"] = (result["snap_recipients"] / result["total_pop"] * 100).round(2)

# Reorder columns
result = result[["year", "state_fips", "state_name", "total_pop", "snap_recipients", "pct_snap"]]

print(result.head(10))

Output:

   year state_fips      state_name  total_pop  snap_recipients  pct_snap
0  2013         01         Alabama    4799918           856692     17.85
1  2013         02          Alaska     prior720936            72633     10.07
2  2013         04         Arizona    6479703           997711     15.40
3  2013         05        Arkansas    2933369           457716     15.60
4  2013         06      California   37966138          4378304     11.53
5  2013         08        Colorado    5183562           479875      9.26
6  2013         09     Connecticut    3583561           392792     10.96
7  2013         10        Delaware     916229           126829     13.84
8  2013         11  Washington DC     633736            99470     15.70
9  2013         12         Florida   19259543          3290845     17.09

This approach uses the requests library directly instead of the census package, which gives you more control over the API URL structure. The pattern https://api.census.gov/data/{year}/acs/acs5 lets you swap in any year.

A note on year availability:

DatasetYears AvailableNotes
ACS 5-Year2009-presentMost complete coverage for small areas
ACS 1-Year2005-presentOnly areas with 65,000+ population
Decennial Census2000, 2010, 2020Complete count, limited variables

With time series data, you can create trend visualizations, identify communities where conditions are improving or worsening, and provide temporal context for current disparities.

Overlaying Facility Locations on Demographic Data

The real power of combining Census data with facility locations is the ability to identify mismatches between where services are and where they’re needed. This approach operationalizes what researchers call “spatial accessibility” (Guagliardo, 2004). Studies of mental health service distribution have consistently found that facilities tend to cluster in higher-income areas, even when need is greater in lower-income communities (Cook et al., 2017).

Python
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import contextily as cx

# Load facility data
facilities = pd.read_csv('MI_MH_facilities_detroit_metro.csv')
facilities_gdf = gpd.GeoDataFrame(
    facilities,
    geometry=gpd.points_from_xy(facilities.longitude, facilities.latitude),
    crs="EPSG:4326"
).to_crs(epsg=3857)

# Filter to facilities with Spanish services
spanish_facilities = facilities_gdf[facilities_gdf['spanish'] == 1]

# Create map showing LEP population with Spanish-service facilities
fig, ax = plt.subplots(figsize=(12, 12))

# Layer 1: Tracts colored by Spanish LEP population
metro_merged.plot(
    column='spanish_lep',
    ax=ax,
    legend=True,
    legend_kwds={'label': 'Spanish-Speaking LEP Population', 'shrink': 0.6},
    cmap='Oranges',
    edgecolor='white',
    linewidth=0.2,
    missing_kwds={'color': 'lightgray'}
)

# Layer 2: Facilities with Spanish services
spanish_facilities.plot(
    ax=ax,
    color='#2563eb',
    markersize=100,
    edgecolor='white',
    linewidth=1.5,
    zorder=5,
    label='Facilities with Spanish services'
)

cx.add_basemap(ax, source=cx.providers.CartoDB.PositronNoLabels, alpha=0.4)

ax.set_title('Spanish-Speaking LEP Population and Mental Health Facilities\nwith Spanish Language Services', fontsize=14)
ax.legend(loc='lower right', frameon=True, facecolor='white')
ax.set_axis_off()
plt.tight_layout()
plt.show()

This map overlays facilities offering Spanish-language services on top of census tracts shaded by their Spanish-speaking population. If blue points cluster in lightly-shaded areas while heavily-shaded areas have few points, we have evidence of a service mismatch.

Handling Common Issues

  1. Missing data. Some census tracts have missing values, often because the population is too small for reliable estimates or the tract contains only group quarters (prisons, nursing homes). Use the missing_kwds parameter in GeoPandas to handle these gracefully.
  2. Margins of error. ACS estimates come with uncertainty. For any variable ending in E (estimate), there’s a corresponding variable ending in M (margin of error). When differences between areas are small, check whether they exceed the margins of error before drawing conclusions.
  3. Variable availability. Not all variables are available at all geographic levels. Some detailed tables are only available at the county or state level. The API will return an error if you request an unavailable variable.
  4. Rate limits. The Census API has rate limits. If you’re making many requests, add delays between calls or request multiple variables in a single call rather than making separate calls for each variable.

Resources

The Census Bureau’s API documentation (https://www.census.gov/data/developers/data-sets.html) lists all available datasets and their variables.

The census package documentation (https://github.com/datamade/census) provides examples for different types of geographic queries.

For finding variable codes, the Census Reporter website (https://censusreporter.org/) offers a more user-friendly interface than the official variable list.

Social Explorer (https://www.socialexplorer.com/) provides an alternative interface for exploring Census data, with some features available for free.

  • January 9, 2026