[Stata] Regressions with interaction effects (categorical x categorical) and (continuous x categorical) and plotting interaction

In this post, I will show you how to run regressions with interaction effects using Stata, and how to plot the interaction effects using the margins and marginsplot commands. Interaction effects are useful when you want to examine how the effect of one variable depends on the level of another variable. For example, you may want to know how the effect of gender (categorical) on self-rated health varies by race (categorical) or how the effect of age (continuous) on self-rated health differs by diabetes status (categorical).

Data

I will use the webuse nhanes2 data, which contains information on health and nutrition of a sample of US adults. The data has 10,335 observations and 15 variables. The variables of interest are:

  • hlthstat: a measure of self-reported health status, ranging from 1 (excellent) to 5 (poor)
  • sex: a dummy variable indicating the sex of the respondent (0 = male, 1 = female)
  • race: a categorical variable indicating the race of the respondent (1 = white, 2 = black, 3 = other)
  • age: the age of the respondent in years
  • diabetes: a dummy variable indicating whether the respondent has diabetes (0 = no, 1 = yes)
  • region: a categorical variable indicating the region of residence of the respondent (1 = NE, 2 = MW, 3 = S, 4 = W)

Regression with interaction effects

I will run two regressions with interaction effects: one with a categorical x categorical interaction (sex x race), and one with a continuous x categorical interaction (age x diabetes). The dependent variable is hlthstat, and the other variables are included as controls. The syntax for creating interaction terms in Stata is to use the ## operator, which creates both the main effects and the interaction effect. For example, sex##race creates sex, race, and sex x race.

Categorical by Categorical: Gender x Race

The first regression for categorical (gender) by categorical (race) interaction is:

Stata
reg hlthstat i.sex##i.race age i.diabetes i.region

The interpretation of the coefficients is as follows:

  • The main effect of sex is the difference in hlthstat between males and females when race is white (the reference category). The coefficient of -0.045 means that females have a slightly lower hlthstat than males, holding other variables constant. This effect is significant at the 5% level.
  • The main effect of race is the difference in hlthstat between each race and white (the reference category) when sex is male (the reference category). The coefficient of -0.400 means that blacks have a lower hlthstat than whites by 0.400 units, holding other variables constant. This effect is significant at the 0% level. The coefficient of -0.102 means that others have a lower hlthstat than whites by 0.102 units, holding other variables constant. This effect is not significant at the 5% level.
  • The interaction effect of sex x race is the difference in the effect of sex on hlthstat between each race and white (the reference category). The coefficient of -0.136 means that the effect of being female on hlthstat is lower by 0.136 units for blacks than for whites, holding other variables constant. This effect is marginally significant at the 5% level. The coefficient of 0.045 means that the effect of being female on hlthstat is higher by 0.045 units for others than for whites, holding other variables constant. This effect is not significant at the 5% level.
  • The main effect of age is the change in hlthstat for a one-year increase in age, holding other variables constant. The coefficient of -0.024 means that hlthstat decreases by 0.024 units for each year of age, holding other variables constant. This effect is significant at the 0% level.
  • The main effect of diabetes is the difference in hlthstat between diabetics and non-diabetics, holding other variables constant. The coefficient of -0.769 means that diabetics have a lower hlthstat than non-diabetics by 0.769 units, holding other variables constant. This effect is significant at the 0% level.
  • The main effect of region is the difference in hlthstat between each region and NE (the reference category), holding other variables constant. The coefficients of -0.099, -0.321, and -0.224 mean that MW, S, and W have lower hlthstat than NE by 0.099, 0.321, and 0.224 units, respectively, holding other variables constant. These effects are significant at the 2%, 0%, and 0% levels, respectively.

Plotting interactions

Stata
margins sex#race
marginsplot
Stata
ssc install interactplot
interactplot, byplot

Continuous by categorical

I will run a regression with a continuous (age) by categorical (diabetes status) interaction effect between age and diabetes. The dependent variable is hlthstat, and the other variables are included as controls. The syntax for creating an interaction term in Stata is to use the ## operator, which creates both the main effects and the interaction effect. For example, c.age##i.diabetes creates age, diabetes, and age x diabetes.

Stata
reg hlthstat c.age##i.diabetes i.region i.sex i.race

The interpretation of the coefficients is as follows:

  • The main effect of age is the change in hlthstat for a one-unit increase in age, holding other variables constant. The coefficient of age is the slope of the regression line for non-diabetics (the reference category). The coefficient of -0.0246, which means that hlthstat decreases by 0.0246 units for each year of age, holding other variables constant. This effect is significant at the 0% level.
  • The main effect of diabetes is the difference in hlthstat between diabetics and non-diabetics, holding other variables constant. The coefficient of diabetes is the vertical shift of the regression line for diabetics relative to non-diabetics. The coefficient of -1.593 means that diabetics have a lower hlthstat than non-diabetics by 1.593 units, holding other variables constant. This effect is significant at the 0% level.
  • The interaction effect of age x diabetes is the difference in the effect of age on hlthstat between diabetics and non-diabetics, holding other variables constant. The coefficient of age x diabetes is the difference in the slope of the regression line between diabetics and non-diabetics. The coefficient of 0.014 means that the effect of age on hlthstat is higher by 0.014 units for diabetics than for non-diabetics, holding other variables constant. This effect is significant at the 2% level.
  • The main effect of region is the difference in hlthstat between each region and NE (the reference category), holding other variables constant. The coefficients of -0.098, -0.321, and -0.223 mean that MW, S, and W have lower hlthstat than NE by 0.098, 0.321, and 0.223 units, respectively, holding other variables constant. These effects are significant at the 2%, 0%, and 0% levels, respectively.
  • The main effect of sex is the difference in hlthstat between females and males (the reference category), holding other variables constant. The coefficient of -0.057 means that females have a lower hlthstat than males by 0.057 units, holding other variables constant. This effect is significant at the 8% level.
  • The main effect of race is the difference in hlthstat between each race and white (the reference category), holding other variables constant. The coefficient of -0.472 means that blacks have a lower hlthstat than whites by 0.472 units, holding other variables constant. This effect is significant at the 0% level. The coefficient of -0.083 means that others have a lower hlthstat than whites by 0.083 units, holding other variables constant. This effect is not significant at the 5% level.

To see the role of age variable differentiated by diabetes status, you can also run the regressions by subgroup to see the differences in coefficients.

Stata
bys diabetes: reg hlthstat c.age i.region i.sex i.race

Using bys : before the command, Stata will return the regression analysis by subgroup separately (as you can see the number of observations per regression table), and you can see that the coefficient for the diabetes group is -0.025 in comparison to -0.011 for the non-diabetes group. According to the interaction effect, you can say that “The role of age on self-rated health is differentiated by the diabetes status (β=0.136, p<0.01). Specifically, the coefficient of age on self-rated health is higher among non-diabetes groups (β=-0.025, p<0.001), compared to the diabetes group (β=-0.011, p<0.05).”

For your information, the command “bys diabetes: reg hlthstat c.age i.region i.sex i.race” is the same as the two lines of following commands. You can take any approach that you prefer to use.

Stata
reg hlthstat c.age i.region i.sex i.race if diabetes==0
reg hlthstat c.age i.region i.sex i.race if diabetes==1

Plotting interactions

To visualize the interaction effect between age and diabetes, I will use the margins and marginsplot commands in Stata. The margins command calculates the predicted values of hlthstat for different values of age and diabetes, holding other variables at their means. The marginsplot command plots these predicted values with confidence intervals.

First, you need to identify the minimum and maximum (range) of the variables in the interaction effect.

Stata
codebook diabetes
codebook age

Then, specifying the variables in interaction effect in the at() option results in predictions at each combination of values. After margins command, you can simply put marginsplot command and it will return the graph for the output from margins command. For the categorical predictor, I recommend put it as by() plot, since it is much easier to interpret.

Stata
margins, at(diabetes=(0 1) age=(20(10)80))
marginsplot, by(diabetes)

The interpretation of the plot is as follows:

  • The plot shows the predicted values of hlthstat for different values of age and diabetes, with 95% confidence intervals.
  • The blue line represents the non-diabetics, and the red line represents the diabetics.
  • The plot shows that hlthstat decreases with age for both groups, but the slope is steeper for the non-diabetics than for the diabetics. This means that the effect of age on hlthstat is lower for the diabetics than for the non-diabetics, as indicated by the positive coefficient of age x diabetes in the regression.
  • The plot also shows that hlthstat is lower for the diabetics than for the non-diabetics at any given age, but the gap between the two groups narrows as age increases. This means that the difference in hlthstat between diabetics and non-diabetics depends on age, as indicated by the negative coefficient of diabetes in the regression.

(Ref) marginsplot without by(diabetes)

Reference

  • October 9, 2023