[Stata] Regression Diagnostics: Assessing normality of residuals
Regression diagnostics are crucial for validating the assumptions underlying linear regression models. One of the fundamental assumptions is the normality of residuals, which, if violated, can lead to inefficiencies in the estimation process and inaccuracies in inference. This blog post introduces a comprehensive approach to assessing the normality of residuals using the NHANES2 dataset in Stata, guided by the methods outlined in UCLA’s IDRE Statistics tutorials.
Before diving into the specifics of normality testing, it’s essential to understand the role of regression diagnostics. They help identify violations of linear regression assumptions, including linearity, independence, homoscedasticity, and normality of residuals. Among these, the normality assumption ensures that the residuals—the differences between observed and predicted values—are distributed normally around the regression line.
Step 1: Setting Up the Analysis with the NHANES2 Dataset
The National Health and Nutrition Examination Survey (NHANES) dataset provides a comprehensive basis for various statistical analyses. To start with regression diagnostics, load the NHANES2 dataset in Stata using the webuse
command. This dataset includes various health-related variables suitable for regression analysis.
webuse nhanes2
Step 2: Estimating Residuals
After fitting a regression model to your data, the next step involves estimating residuals. Residuals are the differences between the observed values and those predicted by the model. In Stata, you can calculate residuals using the predict
command following a regression analysis.
regress dependent_variable independent_variables
predict r, resid
Step 3: Assessing Normality of Residuals
Normal Probability Plot (Pnorm)
The pnorm
command in Stata generates a normal probability plot, which is a graphical tool for assessing if residuals follow a normal distribution. Deviations from a straight line in the plot indicate deviations from normality.
pnorm r
Normal Quantile Plot (Qnorm)
The qnorm
command produces a normal quantile (or Q-Q) plot, offering another perspective on the distribution of residuals. This plot is particularly useful for identifying deviations in the tails of the distribution.
qnorm r
Kernel Density Plot with Normal Overlay (Kdensity)
The kdensity
command, along with the normal
option, overlays a normal density curve over the kernel density estimation of the residuals. This provides a visual comparison between the empirical distribution of residuals and a normal distribution.
kdensity r, normal
If you are using a non-continuous outcome variable, for example, binary outcome, you can compare the Kernal Density Function for the Linear Probability Model and Logit Model as follows. You can find that the Logit Model has a kernel density function that is much closer to a normal distribution (right figure) compared to the kernel density function from LPM (left figure).
Step 4: Interpreting the Results
When assessing the plots generated by pnorm
, qnorm
, and kdensity
, you’re looking for how closely the residuals follow a normal distribution. In the normal probability and quantile plots, a linear pattern suggests normality, while deviations suggest otherwise. The kernel density plot should closely mirror the overlaid normal curve for the residuals to be considered normally distributed.
- Pnorm Plot: Look for a straight diagonal line. Deviations, especially near the center, suggest non-normality.
- Qnorm Plot: Focus on the linearity of points. Curvature or substantial deviations in the tails indicate non-normality.
- Kdensity Plot: The more the empirical distribution mimics the normal curve, the more likely the residuals are normally distributed.
Conclusion
Assessing the normality of residuals is a fundamental step in regression diagnostics, ensuring the validity of regression analysis. By using Stata’s pnorm
, qnorm
, and kdensity
commands, researchers can visually inspect the distribution of residuals. While these graphical methods provide intuitive insights, remember that they are subjective. For a more objective assessment, complement these visual diagnostics with statistical tests for normality, such as the Shapiro-Wilk test. Through careful diagnostics, researchers can validate their regression models, leading to more reliable and accurate conclusions.