[Stata] Instrumental Variables Regression: ivregress, ivreg2
What are Instrumental Variables?
Instrumental variable regression is a statistical method used when you suspect that there’s a hidden bias affecting the relationship between your variables. It’s like having a sneaky confounder that you can’t measure directly, but you know it’s there, messing with your results. So, you bring in an instrumental variable—a kind of secret agent—to help you uncover the true effect of your variable of interest.
Imagine you’re studying the effect of a new counseling program (treatment) on reducing stress levels (outcome) among social workers. However, you suspect that those who choose to participate might already be more motivated or less stressed, which could bias your results.
Step 1: Find Your Instrument You need an instrumental variable that’s related to the likelihood of participating in the program but not directly related to stress levels. Let’s say you find that social workers who live closer to the counseling center are more likely to participate. Proximity to the center becomes your instrumental variable.
Step 2: First Stage Regression You first run a regression with the instrumental variable (proximity) predicting the treatment (participation in the program). This gives you the predicted values of treatment, which are free from the bias of the unmeasured confounder (motivation or initial stress levels).
Step 3: Second Stage Regression Next, you use these predicted values from the first stage as your ‘clean’ treatment variable to predict the outcome (stress levels). This second regression tells you the effect of the counseling program on stress levels, without the bias introduced by the unmeasured confounder.
Suppose you have data on social workers’ stress levels and their participation in the counseling program. You also know how far each social worker lives from the center.
- First Stage: You find that living closer to the center significantly predicts higher participation.
- Second Stage: Using the predicted participation from the first stage, you find that participating in the counseling program leads to lower stress levels.
By using the instrumental variable of proximity, you’ve managed to isolate the effect of the counseling program on stress levels, accounting for the potential bias of self-selection into the program.
Stata Commands for Instrumental Variables
- ivregress: ivregress is a built-in command provided by Stata for instrumental variables regression.
- ivreg2: As a user-created command, ivreg2 extends the functionality of ivregress. It provides additional features, such as testing for endogeneity, weak instruments, and overidentification.
- ivprobit: Instrumental variables and two-stage least squares for binary outcome
- ivpoisson: Instrumental variables and two-stage least squares for count outcome
- xtivreg: Instrumental variables and two-stage least squares for panel-data models
- XTOVERID: Stata module to calculate tests of overidentifying restrictions after xtreg, xtivreg, xtivreg2, xthtaylor
- ivreghdfe: Extended instrumental variable regressions with multiple levels of fixed effects
- iverg2h: Stata module to perform instrumental variables estimation using heteroskedasticity-based instruments
- imperfectiv: Stata module to estimate bounds with “Imperfect Instrumental Variables” (Nevo and Rosen, 2012)
- ivmediate: Stata module to perform causal mediation analysis in instrumental-variables regressions
- SPATIAL_HAC_IV: Stata module to estimate an instrumental variable regression, adjusting standard errors for spatial correlation, heteroskedasticity, and autocorrelation
- PARIV: Stata module to perform nearly-collinear robust instrumental-variables regression
- CQIV: Stata module to perform censored quantile instrumental variables regression
Tests
- IVHETTEST: Stata module to perform Pagan-Hall and related heteroskedasticity tests after IV
- TESTEX: Stata module for a statistical test of the exclusion restriction of an instrumental variable (IV)
- WEAKIV: Stata module to perform weak-instrument-robust tests and confidence intervals for instrumental-variable (IV) estimation of linear, probit and tobit models
- IVENDOG: Stata module to calculate Durbin-Wu-Hausman endogeneity test after ivreg
- IVTREATREG: Stata module to estimate binary treatment models with idiosyncratic average effect
- UNDERID: Stata module producing postestimation tests of under- and over-identification after linear IV estimation
- IVDESC: Stata module to profile compliers and non-compliers for instrumental variable analysis
Instrumental Variable Regression with Stata: ivreg2
You can load the nlswork.dta
dataset from the default Stata Press website using the webuse
command:
webuse nlswork, clear
Step 1. Specify your regression model
For example, let’s say we want to estimate the effect of years of education, instrumenting education with mother’s education.
- outcome: wages (ln_wage)
- predictor: years of education (grade)
- instrumental variable: mother’s education (msp)
ssc install ivreg2
ivreg2 ln_wage (grade = msp)
- ln_wage (dependent variable):
- The coefficient for grade is approximately 0.2313, and it is statistically significant (p-value < 0.001). This suggests that grade has a positive effect on ln_wage.
- Identification Tests:
- The Anderson canonical correlation LM statistic tests for underidentification. The p-value is 0.0001, indicating that the model is not underidentified.
- The Cragg-Donald Wald F statistic tests for weak identification. The p-value is also 0.0001, suggesting that the instrument is not weak.
- Since the equation is exactly identified (no overidentification), the Sargan statistic reports a p-value of 0.000.
- Instrumentation:
- grade is instrumented by msp.
The basic command for ivreg2 is as follows. Please replace y
with your dependent variable, x1
with your endogenous regressor, z1
with your instrument, and x2
with other control variables.
ivreg2 y (x1 = z1) x2
Some key options:
- robust: robust standard errors
- cluster(varname): clustered standard errors
- first: report first-stage regression estimates
- savefirst: save first-stage estimates
- ffirst: use F-statistic form of first-stage output
Step 2. Adding Control Variables
To add control variables, simply include them after the dependent variable. For instance, to control for experience and tenure:
ivreg2 ln_wage ttl_exp tenure (grade = msp)
Step 3. Testing Endogeneity: ivendog
command
The Wu-Hausman F test and the Durbin-Wu-Hausman chi-sq test are used to test for endogeneity of a regressor (in this case, the variable grade). You can perform this test by simply putting the ivendog command (developed by Baum et al. 2007) after ivreg2 command.
ivendog
- Wu-Hausman F Test:
- Null Hypothesis (H0): The regressor (in this case, grade) is exogenous (i.e., not correlated with the error term).
- The test statistic is 8.77348, and the associated p-value is 0.00306.
- Since the p-value is less than 0.05, we reject the null hypothesis.
- Interpretation: There is evidence to suggest that grade is endogenous (correlated with the error term) in the regression model.
- Durbin-Wu-Hausman Chi-Square Test:
- This test is another way to assess endogeneity.
- Null Hypothesis (H0): The regressor (again, grade) is exogenous.
- The test statistic is 8.77170, and the associated p-value is also 0.00306.
- Similar to the Wu-Hausman F test, the p-value is less than 0.05, leading us to reject the null hypothesis.
- Conclusion: The evidence supports the idea that grade is endogenous.
In summary, both tests indicate that grade is likely endogenous in your regression model. This means that there may be omitted variables or other issues affecting the relationship between grade and the dependent variable. Researchers often address endogeneity by using instrumental variables or other econometric techniques.
Step 4. Testing heteroskedasticity
The Pagan-Hall general test statistic is used to test for heteroskedasticity in the context of instrumental variables (IV) estimation. You can perform this test by simply putting the ivhettest command (developed by Schaffer 2023) after ivreg2 command.
ssc install ivhettest, replace
ivhettest
- Null Hypothesis (H0): The disturbance (error term) is homoskedastic (i.e., the variance of the error term is constant across observations).
Since the p-value is 0.5596, which is much greater than 0.05, we fail to reject the null hypothesis. This means that there is no statistical evidence to suggest the presence of heteroskedasticity in the model; the assumption of homoskedasticity is not violated.
In simpler terms, the test indicates that the variance of the error terms in your IV regression model is consistent across different levels of the instrumental variables, and there’s no need to adjust for heteroskedasticity based on this test.
Reference
Instrumental Variables Regressions
Instrumental Variable Regression | DATA with STATA (ubc.ca)
Instrumental Variables (slides).pdf
Stata How-to: Instrumental Variables using 2SLS.pdf
andrewproctor.github.io/assets/StataSeminar4.pdf