[Python] GIS (1): Loading Geographic Data using geopandas

In this post, we will learn how to load and visualize spatial data in Python 🗺️

Setting Up Your Environment

We’ll use several Python libraries for geographic data. In Google Colab, you can install them by running:

Python
!pip install geopandas geopy

GeoPandas extends the pandas library you already know to handle geographic data. It lets you read spatial files, manipulate geometries, and create maps using familiar DataFrame operations. Geopy provides geocoding capabilities to convert addresses to coordinates.

Let’s import the libraries we need:

Python
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

Loading Geographic Data

Let’s start with the simplest case: a CSV file containing coordinates. Imagine we have a file listing community mental health centers with their locations:

Python
# Create sample data - mental health centers in Washtenaw County
data = {
    'center_name': ['Washtenaw Community Health', 'Downtown Ann Arbor Clinic', 
                    'Ypsilanti Family Services', 'Saline Counseling Center'],
    'services': ['outpatient', 'crisis', 'outpatient', 'intensive'],
    'languages': ['English, Spanish', 'English', 'English, Arabic', 'English, Spanish'],
    'latitude': [42.28, 42.28, 42.24, 42.17],
    'longitude': [-83.74, -83.75, -83.61, -83.78]
}

df = pd.DataFrame(data)
print(df)

This gives us a regular pandas DataFrame. To convert it to a GeoDataFrame that understands the coordinates are locations, we use:

Python
# Convert to GeoDataFrame
gdf = gpd.GeoDataFrame(
    df, 
    geometry=gpd.points_from_xy(df.longitude, df.latitude),
    crs="EPSG:4326"
)
print(gdf)

Notice that points_from_xy takes longitude first, then latitude. This trips people up constantly. The function uses (x, y) order, and on a standard map, x is horizontal (longitude) and y is vertical (latitude).

The crs="EPSG:4326" part tells GeoPandas these are standard GPS coordinates. You’ll see a new column called geometry that contains Point objects representing each location.

Now let’s load a shapefile. The Census Bureau provides boundary files for various geographies. Here’s how to load boundary data:

Python
# Download county boundaries from the Census Bureau
# This URL points to Michigan counties; adjust the state FIPS code as needed
url = "https://www2.census.gov/geo/tiger/TIGER2023/COUNTY/tl_2023_us_county.zip"
counties = gpd.read_file(url)

# Filter to Michigan (STATEFP == '26')
michigan_counties = counties[counties['STATEFP'] == '26']
print(michigan_counties.head())

With GeoDataFrames, all the pandas operations you know still work. You can filter, group, merge, and compute new columns just like with regular DataFrames.

Creating Your First Map

GeoPandas includes a .plot() method that creates maps directly from GeoDataFrames:

Python
# Simple plot of our service locations
fig, ax = plt.subplots(figsize=(10, 8))
gdf.plot(ax=ax, color='red', markersize=100)
ax.set_title('Mental Health Centers in Washtenaw County')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
plt.show()

This creates a basic scatter plot of our points.

Not very informative on its own because there’s no context. Points floating in empty space don’t tell us much. We need a base layer of map to show where these points fall.

Thinking in Layers

Maps are built by stacking layers on top of each other, like transparencies on an overhead projector (if you remember those) or layers in Photoshop. Each layer contains one type of geographic information, and you control which layers appear and in what order.

A typical map might have these layers from bottom to top: a base layer showing land and water, a boundary layer showing county or tract outlines, a polygon layer showing areas shaded by some variable, and a point layer showing specific locations. The order matters because layers on top can obscure layers beneath them.

In Python with GeoPandas, you control layering by plotting to the same axes object. The ax parameter is the key. Let’s build a complete example using real data from the Substance Abuse and Mental Health Services Administration (SAMHSA) National Directory of Mental Health Treatment Facilities. We are going to use the subset of the data on Detroit Metropolitan area, which is already geocoded and processed for Spanish & Arabic language staff availability variables. You can download the data here:

Python
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import contextily as cx

# Load mental health facility data from SAMHSA National Directory
# This dataset contains facilities in the Detroit metropolitan area
facilities = pd.read_csv('MI_MH_facilities_detroit_metro.csv')

# Convert to GeoDataFrame
facilities_gdf = gpd.GeoDataFrame(
    facilities,
    geometry=gpd.points_from_xy(facilities.longitude, facilities.latitude),
    crs="EPSG:4326"
)

# Download census tract boundaries for Michigan
tracts_url = "https://www2.census.gov/geo/tiger/TIGER2023/TRACT/tl_2023_26_tract.zip"
michigan_tracts = gpd.read_file(tracts_url)

# Filter to Detroit metro counties (Wayne=163, Oakland=125, Macomb=099, Washtenaw=161)
metro_counties = ['163', '125', '099', '161']
metro_tracts = michigan_tracts[michigan_tracts['COUNTYFP'].isin(metro_counties)].copy()

# Convert to Web Mercator projection (required for contextily basemap)
metro_tracts = metro_tracts.to_crs(epsg=3857)
facilities_gdf = facilities_gdf.to_crs(epsg=3857)

# Build the layered map
fig, ax = plt.subplots(figsize=(12, 12))

# Set map extent based on tract bounds
minx, miny, maxx, maxy = metro_tracts.total_bounds
padding = 2000
ax.set_xlim(minx - padding, maxx + padding)
ax.set_ylim(miny - padding, maxy + padding)

# Layer 1: Basemap tiles
cx.add_basemap(ax, source=cx.providers.CartoDB.Positron)

# Layer 2: County boundaries (dissolve tracts into counties)
metro_tracts.dissolve(by='COUNTYFP').plot(ax=ax, facecolor='none', edgecolor='#333333', linewidth=1.5)

# Layer 3: Facility locations as points
facilities_gdf.plot(ax=ax, color='darkblue', markersize=50, zorder=5, edgecolor='white', linewidth=1)

ax.set_title('Mental Health Facilities in Detroit Metro Area\n(Source: SAMHSA National Directory)', fontsize=14)
ax.set_axis_off()
plt.tight_layout()
plt.show()

Every .plot() call that uses the same ax adds to the same map. If you forget the ax=ax parameter, Python creates a

new figure for each layer, which isn’t what you want.

The contextily library adds real map tiles (roads, water, parks) as a background layer. The key step is converting your data to Web Mercator projection (EPSG:3857) before adding the basemap, because that’s the projection web map tiles use.

The alpha parameter controls transparency, which is useful when you want to see through upper layers to what’s beneath. The zorder parameter can explicitly control which layer appears on top when the automatic ordering doesn’t work the way you need.

Visualizing Differences

Let’s create a more useful visualization that shows which centers offer services in different languages:

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import contextily as cx

# Load mental health facility data from SAMHSA National Directory
facilities = pd.read_csv('MI_MH_facilities_detroit_metro.csv')

# Convert to GeoDataFrame
facilities_gdf = gpd.GeoDataFrame(
    facilities,
    geometry=gpd.points_from_xy(facilities.longitude, facilities.latitude),
    crs="EPSG:4326"
)

# Convert to Web Mercator for basemap
facilities_gdf = facilities_gdf.to_crs(epsg=3857)

# Create language service categories
def categorize_language(row):
    if row['spanish'] == 1 and row['arabic'] == 1:
        return 'Both Spanish & Arabic'
    elif row['spanish'] == 1:
        return 'Spanish only'
    elif row['arabic'] == 1:
        return 'Arabic only'
    else:
        return 'No language services'

facilities_gdf['language_category'] = facilities_gdf.apply(categorize_language, axis=1)

# Create a map showing language services
fig, ax = plt.subplots(figsize=(12, 12))

# Get bounds and set extent
minx, miny, maxx, maxy = facilities_gdf.total_bounds
ax.set_xlim(minx - 5000, maxx + 5000)
ax.set_ylim(miny - 5000, maxy + 5000)

# Layer 1: Basemap
cx.add_basemap(ax, source=cx.providers.CartoDB.Positron)

# Define colors for each category
colors = {
    'No language services': 'gray',
    'Spanish only': '#2563eb',      # blue
    'Arabic only': '#16a34a',        # green
    'Both Spanish & Arabic': '#dc2626'  # red
}

# Plot each category
for category, color in colors.items():
    subset = facilities_gdf[facilities_gdf['language_category'] == category]
    if len(subset) > 0:
        size = 60 if category == 'No language services' else 100
        subset.plot(ax=ax, color=color, markersize=size, label=category,
                   edgecolor='white', linewidth=1, zorder=5 if category == 'No language services' else 6)

ax.set_title('Mental Health Facilities by Language Services\n(Source: SAMHSA National Directory)', fontsize=14)
ax.legend(loc='lower right', frameon=True, facecolor='white', framealpha=0.9)
ax.set_axis_off()
plt.tight_layout()
plt.show()

This kind of visualization can immediately show gaps and we can identify a potential service desert.

Common Mistakes and How to Avoid Them

  1. Coordinates in the wrong order. If your points show up in the middle of the ocean or on the wrong continent, check whether you swapped latitude and longitude. Remember: points_from_xy(longitude, latitude) with longitude first.
  2. Missing CRS information. If you see warnings about CRS mismatch or your points don’t overlay correctly on boundaries, check that all your GeoDataFrames use the same coordinate reference system. You can check with gdf.crs and convert with gdf.to_crs("EPSG:4326").
  3. File path issues with shapefiles. When loading a shapefile, point to the .shp file, but make sure all the companion files (.dbf, .shx, etc.) are in the same folder. If GeoPandas can’t find them, you’ll get errors.
  4. Memory issues with large files. Detailed boundary files (like census block groups for an entire state) can be large. Start with smaller geographic areas while developing your code, then scale up once everything works. You can also use Google Colab for larger RAM availability.

Resources

The Census Bureau’s TIGER/Line Shapefiles (https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html) are the primary source for U.S. boundary files at various geographic levels.

GeoPandas documentation (https://geopandas.org/en/stable/) provides detailed guides for all the spatial operations we’ll use in this course.

  • January 8, 2026