[Stata] How to Check Assumptions for Regressions (vif, regcheck)
Before we can trust the results of a regression model, we need to make sure that some basic assumptions are met. These assumptions are:
- Linearity: The relationship between the dependent variable and each independent variable is linear.
- Independence: Observations are independent of each other.
- Homoscedasticity: The residuals have constant variance across different levels of the independent variables.
- Normality: For any fixed value of independent variables, the dependent variable is normally distributed.
If these assumptions are violated, the regression coefficients, standard errors, and hypothesis tests may be biased or inaccurate. Therefore, it is important to check these assumptions before drawing any conclusions from a regression model.
In this blog post, I will show you how to check these assumptions using Stata, using an example dataset from the NHANES2. The dependent variable is hlthstat
, which is a measure of health status ranging from 1 (excellent) to 5 (poor). The independent variables are age
, sex
, race
, region
, and houssiz
. I will use a linear regression model to examine how these variables affect health status.
webuse nhanes2
Step 1. Running the Regression Model
The first step is to run the regression model using the reg
command in Stata. Here is the syntax and output:
reg hlthstat age i.sex i.race i.region houssiz
The output shows the regression coefficients, standard errors, t-statistics, p-values, and confidence intervals for each independent variable, as well as some overall statistics for the model such as R-squared and F-test.
Step 2. Checking Assumptions with regcheck
ssc install regcheck
reg y x1 x2 x3 x4
regcheck
The user-created command regcheck
provides the tests for regressions at once in one command! You can find that the 1), 3), and 5) assumptions are not met.
In this kind of situation, You can consider a more robust or another model (other than OLS regressions) to solve the problems. You can easily run it by adding the vce(robust)
option with robust standard errors.
reg y x1 x2 x3 x4, vce(robust)
Tip. Checking multicollinearity with vif
command
Multicollinearity occurs when independent variables in a regression model are highly correlated. This correlation can cause problems when interpreting the results because it becomes difficult for the model to estimate the relationship between each independent variable and the dependent variable independently. The coefficient estimates can swing wildly based on which other independent variables are in the model, and the coefficients become very sensitive to small changes in the model. As a result, there is a greater probability that we will incorrectly conclude that a variable is not statistically significant.
VIF stands for variance inflation factor, which is a measure of how much the variance of a regression coefficient is inflated by multicollinearity. Multicollinearity occurs when two or more explanatory variables in a regression model are highly correlated to each other, such that they do not provide unique or independent information.
If you want to conduct the VIF test alone in Stata, you can just use vif
command right after your regression command as follows. The following is a rule of thumb on the threshold of VIF for serious problems in multicollinearity:
Johnston, R., Jones, K., & Manley, D. (2018): VIF > 2.5 is problematic
Sheather, S. (2009): VIF > 5 is problematic
Vittinghoff, E., Glidden, D. V., Shiboski, S. C., & McCulloch, C. E. (2006): VIF > 10 is problematic
Even though there is a debate over the benefit of using the VIF score (e.g., this article), I think the VIF score is useful in assessing whether the model is overfitted or not in regressions. If you have a variable that has a VIF score of more than 10 or it is, on average, more than 5, I will consider reducing the number of variables to avoid the overfitting issue.
For your information, the moderation analysis (with interaction terms) will show a very high VIF since the interaction terms are correlated with the main effects by nature. You can see in this paper (McClelland et al., 2017) that multicollinearity is not a problem in moderation analysis in regressions.