[Stata] Two-way ANOVA (anova and margins)
In this post, I will show you how to perform a two-way analysis of variance (ANOVA) in Stata using the nhanes2
dataset. A two-way ANOVA is used to compare the means of three or more groups that are split by two factors. For example, you can use a two-way ANOVA to test whether there is an interaction between sex and race (independent variable) on blood pressure (dependent variable).
The nhanes2 dataset contains data from the second National Health and Nutrition Examination Survey (NHANES II), which was conducted from 1976 to 1980 in the United States. The dataset has 10,351 observations and 15 variables, including demographic, health, and nutrition variables. You can load the dataset from the Stata website by typing:
webuse nhanes2
Suppose we are interested in testing whether there is an interaction between sex and race on systolic blood pressure (bpsystol
). Systolic blood pressure is the pressure in the arteries when the heart beats. The sex
is a binary variable coded as 1 for male and 2 for female. The race
is a categorical variable coded as 1 for white, 2 for black, and 3 for other.
The two-way ANOVA model
To perform a two-way ANOVA, we need to specify a linear model that includes the main effects of sex and race, as well as their interaction. The model can be written as:
bpsystol=β_0+β_1sex+β_2race+β_3sexr×race+ϵ
where ϵ is the error term that follows a normal distribution with mean zero and constant variance. To fit the two-way ANOVA model in Stata, we can use the anova
command. The syntax is:
anova depvar factor1 factor2 factor1#factor2
where depvar
is the dependent variable, factor1
and factor2
are the two factors, and factor1#factor2
is the interaction term. For our example, the command is:
anova bpsystol sex race sex#race
The output consists of four parts:
- The first part shows the number of observations, the degrees of freedom (df), and the residual sum of squares (RSS).
- The second part shows the analysis of variance table, which displays the sources of variation (sex, race, sex#race, and residual), their sum of squares (SS), their degrees of freedom (df), their mean squares (MS), their F-statistics (F), and their p-values (Prob > F).
- The third part shows the estimated marginal means for each level of each factor and their interaction. These are also known as cell means or predicted means.
- The fourth part shows the contrasts among the estimated marginal means and their standard errors, t-statistics, and p-values.
The two-way ANOVA results
To interpret the results of the two-way ANOVA, we need to look at the p-values of the F-statistics for each source of variation. The null hypotheses for each source are:
- H0: There is no main effect of sex on bpsystol. (p=0.0247)
- H0: There is no main effect of race on bpsystol. (p=0.0001)
- H0: There is no interaction effect between sex and race on bpsystol. (p=0.113)
The alternative hypotheses are the negations of the null hypotheses. To test these hypotheses, we use a significance level of 0.05. Therefore, we reject a null hypothesis if its p-value is less than 0.05.
From the output, we can see that:
- The p-value for sex is 0.0247, which is less than 0.05. This means that we reject the null hypothesis that there is no main effect of sex on bpsystol. We conclude that there is a significant difference in bpsystol between males and females.
- The p-value for race is 0.0001, which is less than 0.05. This means that we reject the null hypothesis that there is no main effect of race on bpsystol. We conclude that there is a significant difference in bpsystol among different races.
- The p-value for sex#race is 0.0113, which is less than 0.05. This means that we reject the null hypothesis that there is no interaction effect between sex and race on bpsystol. We conclude that there is a significant interaction between sex and race on bpsystol.
The interaction effect means that the effect of sex on bpsystol depends on the level of race, and vice versa. To visualize the interaction effect, we can plot the estimated marginal means for each combination of sex and race. We can use the margins
and marginsplot
command after the anova
command to do this. The syntax is:
margins factor1#factor2
marginsplot, by(factor1 factor2)
The output of the command is shown below:
The plot shows the estimated marginal means of bpsystol for each combination of sex and race, along with their 95% confidence intervals. These results indicate that the effect of sex on bpsystol is not consistent across different races, and the effect of race on bpsystol is not consistent across different sex. This is what we mean by an interaction effect.