[Stata] Importing SAS (XPT & SAS7BDAT) Files into Stata with Value Labels using Python

US government-provided datasets, such as the Behavioral Risk Factor Surveillance System (BRFSS) and Youth Risk Behavior Surveillance (YRBS), are distributed in SAS formats (XPT and SAS7BDAT). For Stata users unfamiliar with SAS, importing these datasets while preserving value labels can be challenging. This guide demonstrates how to efficiently import SAS files into Stata and automatically convert value labels using Python.

Step 1: Download the SAS Data Files

First, download the SAS datasets from the respective government agency websites. These datasets are typically available in:

  • XPT format (SAS Transport File, commonly used for public data distribution)
  • SAS7BDAT format (SAS proprietary format for datasets)

Ensure you also download the accompanying documentation, which usually includes metadata about variable names, labels, and coded values.

Step 2: Convert XPT File to Stata DTA file

Stata’s built-in functions (import sasxport5 and import sas) allow you to import SAS datasets.

Stata
import sasxport5 β€œdataset.xpt”, clear 
save "dataset.dta", replace // save it in Stata dta format

For some reason, import sasxport5 the command in Stata is so slow. I prefer to use a Python pyreadstat package to do this work faster (- my computer is quite high-RAM, but more than 10 times faster from my experience). I use Google Colab to run this kind of code.

Python
!pip install pyreadstat
import pyreadstat

# Define file paths
xpt_file = "LLCP2021.XPT"
dta_file = "LLCP2021.dta"

# Read the XPT file 
df, meta = pyreadstat.read_xport(xpt_file, encoding="iso-8859-1")

# Rename variables that start with an underscore
df.columns = [col.lstrip("_") if col.startswith("_") else col for col in df.columns]

# Save to Stata format
pyreadstat.write_dta(df, dta_file)

print(f"File converted successfully: {dta_file}")

Step 3: Extract Value Labels and Generate a Stata Do-File

One more step – you want to assign value labels but converted Stata dta file doesn’t have them. Further, you can’t open the provided SAS program file since you don’t have SAS (it’s quite heavy and expensive to install just for this reason). Fortunately, you can convert SAS format file to Stata do file to do this automatically.

For example, the following codes convert SAS program for BRFSS data to Stata do file. I have added several regex (regular expressions) rules considering that Stata doesn’t allow _ or number for the first letter of variables.

Python
import re
import chardet

# Paths to your files
sas_file = "FORMAT22.sas"
stata_do_file = "stata_labels_format22.do"

# Function to ensure valid Stata variable name
def make_valid_stata_name(name):
    return name.lstrip("_")

# Detect file encoding
with open(sas_file, "rb") as f:
    raw_data = f.read()  
    detected_encoding = chardet.detect(raw_data)["encoding"]
    print(f"Detected Encoding: {detected_encoding}")

# Read file content
with open(sas_file, "r", encoding=detected_encoding) as f:
    lines = f.readlines()

# Extract all VALUE formats with restructured names
label_define_commands = []
current_format = None
current_values = []
processed_formats = {}  # Maps original format name to restructured format name

for line in lines:
    line = line.strip()
    
    # Detect format definitions (VALUE)
    match_value_start = re.match(r'^value\s+([^\s]+)', line, re.IGNORECASE)
    if match_value_start:
        if current_format and current_values:
            original_format = current_format.upper()
            digits = re.findall(r'\d+', original_format)
            alpha_part = re.sub(r'\d+', '', original_format)
            restructured_format = make_valid_stata_name(alpha_part + ''.join(digits))
            processed_formats[original_format] = restructured_format
            label_define_commands.append(f'label define {restructured_format} {" ".join(current_values)}')
        
        current_format = match_value_start.group(1)
        current_values = []
        continue
    
    # Detect mappings within VALUE formats
    code_label_pair = re.match(r'^(?!\.)(\d+|\w+)\s*=\s*"(.+?)"', line)
    if code_label_pair and current_format:
        code, label = code_label_pair.group(1), code_label_pair.group(2)
        current_values.append(f'{code} "{label}"')

# Process the last format
if current_format and current_values:
    original_format = current_format.upper()
    digits = re.findall(r'\d+', original_format)
    alpha_part = re.sub(r'\d+', '', original_format)
    restructured_format = make_valid_stata_name(alpha_part + ''.join(digits))
    processed_formats[original_format] = restructured_format
    label_define_commands.append(f'label define {restructured_format} {" ".join(current_values)}')

# Replace label define _varname with label define varname
label_define_commands = [re.sub(r'label define _(\w+)', r'label define \1', cmd) for cmd in label_define_commands]

# Write to the Stata do file
with open(stata_do_file, "w", encoding="utf-8") as f:
    f.write("// Stata do file for labels generated from SAS formats\n\n")
    
    # Write label definitions with restructured names
    f.write("// Value label definitions (restructured to match variable naming pattern)\n")
    f.write("// Original SAS format names have been modified to move numbers to the end\n")
    f.write("\n".join(label_define_commands))
    f.write("\n\n")
    
    # Write mapping information for reference
    f.write("// Format name restructuring (for reference only):\n")
    for orig, restr in processed_formats.items():
        f.write(f"// Original: {orig} β†’ Restructured: {restr}\n")
    f.write("\n\n")

# Print the generated Stata do file content
with open(stata_do_file, "r", encoding="utf-8") as f:
    stata_code_content = f.read()
    print(stata_code_content)

print(f"\u2705 Successfully created '{stata_do_file}' with labels restructured to match variable naming pattern.")

Here is an example of a generated do file from Python programs πŸ™‚

Step 4: Assign value labels using do file

After generating the .do file, apply these codes to your generated Stata dta format file in Step 2. Then you have the Stata datasets with assigned value labels πŸ™‚

Stata
// Auto-label assignment program
capture program drop auto_label_vars
program define auto_label_vars

    * Retrieve the list of all variables in the current dataset
    quietly ds
    local allvars = r(varlist)

    * Retrieve the list of all defined labels
    quietly label dir
    local labellist = r(names)

    * Apply labeling to each variable
    foreach var of local allvars {
        * Check if the variable exists in the defined label list
        local found : list var in labellist

        if `found' {
            label values `var' `var'
            display as result "βœ“ Labeled variable `var'"
        }
    }

end

* Execute the program on the currently active dataset
auto_label_vars
  • March 3, 2025