[Stata] Importing SAS (XPT & SAS7BDAT) Files into Stata with Value Labels using Python
US government-provided datasets, such as the Behavioral Risk Factor Surveillance System (BRFSS) and Youth Risk Behavior Surveillance (YRBS), are distributed in SAS formats (XPT and SAS7BDAT). For Stata users unfamiliar with SAS, importing these datasets while preserving value labels can be challenging. This guide demonstrates how to efficiently import SAS files into Stata and automatically convert value labels using Python.
Step 1: Download the SAS Data Files
First, download the SAS datasets from the respective government agency websites. These datasets are typically available in:
- XPT format (SAS Transport File, commonly used for public data distribution)
- SAS7BDAT format (SAS proprietary format for datasets)
Ensure you also download the accompanying documentation, which usually includes metadata about variable names, labels, and coded values.
Step 2: Convert XPT File to Stata DTA file
Stataβs built-in functions (import sasxport5
and import sas
) allow you to import SAS datasets.
import sasxport5 βdataset.xptβ, clear
save "dataset.dta", replace // save it in Stata dta format
For some reason, import sasxport5
the command in Stata is so slow. I prefer to use a Python pyreadstat
package to do this work faster (- my computer is quite high-RAM, but more than 10 times faster from my experience). I use Google Colab to run this kind of code.
!pip install pyreadstat
import pyreadstat
# Define file paths
xpt_file = "LLCP2021.XPT"
dta_file = "LLCP2021.dta"
# Read the XPT file
df, meta = pyreadstat.read_xport(xpt_file, encoding="iso-8859-1")
# Rename variables that start with an underscore
df.columns = [col.lstrip("_") if col.startswith("_") else col for col in df.columns]
# Save to Stata format
pyreadstat.write_dta(df, dta_file)
print(f"File converted successfully: {dta_file}")
Step 3: Extract Value Labels and Generate a Stata Do-File
One more step – you want to assign value labels but converted Stata dta file doesn’t have them. Further, you can’t open the provided SAS program file since you don’t have SAS (it’s quite heavy and expensive to install just for this reason). Fortunately, you can convert SAS format file to Stata do file to do this automatically.
For example, the following codes convert SAS program for BRFSS data to Stata do file. I have added several regex (regular expressions) rules considering that Stata doesn’t allow _ or number for the first letter of variables.
import re
import chardet
# Paths to your files
sas_file = "FORMAT22.sas"
stata_do_file = "stata_labels_format22.do"
# Function to ensure valid Stata variable name
def make_valid_stata_name(name):
return name.lstrip("_")
# Detect file encoding
with open(sas_file, "rb") as f:
raw_data = f.read()
detected_encoding = chardet.detect(raw_data)["encoding"]
print(f"Detected Encoding: {detected_encoding}")
# Read file content
with open(sas_file, "r", encoding=detected_encoding) as f:
lines = f.readlines()
# Extract all VALUE formats with restructured names
label_define_commands = []
current_format = None
current_values = []
processed_formats = {} # Maps original format name to restructured format name
for line in lines:
line = line.strip()
# Detect format definitions (VALUE)
match_value_start = re.match(r'^value\s+([^\s]+)', line, re.IGNORECASE)
if match_value_start:
if current_format and current_values:
original_format = current_format.upper()
digits = re.findall(r'\d+', original_format)
alpha_part = re.sub(r'\d+', '', original_format)
restructured_format = make_valid_stata_name(alpha_part + ''.join(digits))
processed_formats[original_format] = restructured_format
label_define_commands.append(f'label define {restructured_format} {" ".join(current_values)}')
current_format = match_value_start.group(1)
current_values = []
continue
# Detect mappings within VALUE formats
code_label_pair = re.match(r'^(?!\.)(\d+|\w+)\s*=\s*"(.+?)"', line)
if code_label_pair and current_format:
code, label = code_label_pair.group(1), code_label_pair.group(2)
current_values.append(f'{code} "{label}"')
# Process the last format
if current_format and current_values:
original_format = current_format.upper()
digits = re.findall(r'\d+', original_format)
alpha_part = re.sub(r'\d+', '', original_format)
restructured_format = make_valid_stata_name(alpha_part + ''.join(digits))
processed_formats[original_format] = restructured_format
label_define_commands.append(f'label define {restructured_format} {" ".join(current_values)}')
# Replace label define _varname with label define varname
label_define_commands = [re.sub(r'label define _(\w+)', r'label define \1', cmd) for cmd in label_define_commands]
# Write to the Stata do file
with open(stata_do_file, "w", encoding="utf-8") as f:
f.write("// Stata do file for labels generated from SAS formats\n\n")
# Write label definitions with restructured names
f.write("// Value label definitions (restructured to match variable naming pattern)\n")
f.write("// Original SAS format names have been modified to move numbers to the end\n")
f.write("\n".join(label_define_commands))
f.write("\n\n")
# Write mapping information for reference
f.write("// Format name restructuring (for reference only):\n")
for orig, restr in processed_formats.items():
f.write(f"// Original: {orig} β Restructured: {restr}\n")
f.write("\n\n")
# Print the generated Stata do file content
with open(stata_do_file, "r", encoding="utf-8") as f:
stata_code_content = f.read()
print(stata_code_content)
print(f"\u2705 Successfully created '{stata_do_file}' with labels restructured to match variable naming pattern.")
Here is an example of a generated do file from Python programs π
Step 4: Assign value labels using do file
After generating the .do
file, apply these codes to your generated Stata dta format file in Step 2. Then you have the Stata datasets with assigned value labels π
// Auto-label assignment program
capture program drop auto_label_vars
program define auto_label_vars
* Retrieve the list of all variables in the current dataset
quietly ds
local allvars = r(varlist)
* Retrieve the list of all defined labels
quietly label dir
local labellist = r(names)
* Apply labeling to each variable
foreach var of local allvars {
* Check if the variable exists in the defined label list
local found : list var in labellist
if `found' {
label values `var' `var'
display as result "β Labeled variable `var'"
}
}
end
* Execute the program on the currently active dataset
auto_label_vars