[Stata] Bivariate Linear Regression and Plotting (reg)

Running a bivariate regression can help us understand the relationship between two variables. In this tutorial, we’ll use the nhanes2 dataset available through Stata’s webuse command to demonstrate how to run a bivariate regression and interpret the output, followed by creating a two-way plot to visualize the relationship.

Stata
webuse nhanes2

We’ll examine the relationship between BMI (bmi) and age (age). To run the bivariate regression, use the following command:

Stata
reg yvar xvar

The bivariate regression is based on the formula: Y is a dependent variable, X is an independent variable, a is a coefficient, and b is a constant.

Y = aX+b

Upon executing the above command, Stata will generate an output. In this example, I estimated the relationship between age (independent variable) and BMI (dependent variable). Below is a simplified breakdown of the key statistics:

This regression results could be presented as a formula:

bmi = 0.488*age + 23.212

1. Number of observations (Number of obs = 10,351): This tells you how many data points were included in your regression analysis. Here, you have 10,351 observations.

2. F-statistic (F(1, 10349) = 312.45): The F-statistic tests the overall significance of the model. A high F-statistic value (312.45 in this case) and a low p-value for the F-test (Prob > F = 0.0000) suggests that the model is statistically significant.

3. R-squared (R-squared = 0.0293): R-squared represents the proportion of the variance in the dependent variable (BMI) that is predictable from the independent variable (age). Here, it’s 2.93%, indicating a very small portion of the variance in BMI is explained by age.

4. Adjusted R-squared (Adj R-squared = 0.0292): This is a modified version of R-squared that has been adjusted for the number of predictors in the model. It’s similar to R-squared but takes into account the complexity of the model.

5. Root Mean Square Error (Root MSE = 4.8426): The Root Mean Square Error is the square root of the average of the squared differences between the predicted values and the actual values. It provides a measure of the model’s prediction accuracy.

6. Coefficient for age (age = .0488762): This coefficient indicates that for a one-year increase in age, the BMI increases by approximately 0.049 on average, holding other factors constant.

7. Standard Error for age (Std. err. = .0027651): The standard error measures the accuracy of age’s coefficient by estimating the variation of the coefficient if the same test were run on a different sample of our population.

8. t-statistic for age (t = 17.68): The t-statistic tests the null hypothesis that the coefficient for age is zero (no effect). A larger absolute value of the t-statistic (17.68 here) and a small p-value (P>|t| = 0.000) indicates a significant effect of age on BMI.

9. 95% Confidence Interval for age ([95% conf. interval] = .0434561 to .0542963): This interval estimates the range within which the true population parameter lies with 95% confidence. It suggests that the true effect of age on BMI lies somewhere between 0.043 and 0.054.

10. Constant term (_cons = 23.21209): The constant term is the y-intercept of the regression line when age is zero.

Plotting: Scatter Plot with Fitted Lines

You can draw the scatter plot with a fitted line in Stata using twoway command.

Stata
twoway (scatter yvar xvar) (lfit yvar xvar)

The user-developed command, scatterfit, provides a better figure with an option to add regression parameter as follows.

Stata
scatterfit bmi age, binned regparameters(coef se pval sig adjr2 nobs) 

Calculating Predicted Value with lincom

By using lincom in Stata, you can test specific hypotheses, compare coefficients, and generate predicted values or other linear combinations of coefficients, leveraging the results of your regression analysis to gain further insights into your data.

Suppose you want to calculate the predicted BMI for a 50-year-old individual. You can use lincom to obtain the linear prediction like this:

Stata
lincom _b[_cons] + _b[xvar]*50

This command computes the sum of the constant term and the product of the coefficient for age and 30 (the age of the individual). The results show the predicted BMI for the age of 50 is 25.66, compared to 27.12 for 80-years-old.

lincom will provide an output with a coefficient for the linear combination, a standard error, a z or t-value (depending on your model), a p-value, and a confidence interval. These statistics can be interpreted similarly to those from the original regression output.

  • September 30, 2023