[Stata] Multinomial Logistic Regression: mlogit, mlogtest
Multinomial logistic regression is a method for modeling categorical outcomes with more than two levels. It allows us to estimate the probability of each outcome as a function of some predictor variables, and to test hypotheses about the effects of these variables.
In this blog post, I will use the nhanes2 dataset from Stata, which contains data on health and nutrition of a sample of US adults. I will use the variable region
as the dependent variable, which has four categories: Northeast, Midwest, South, and West. I will use the variables age
, race
, sex
, and rural
as the predictors.
Step 1: Use the mlogit command to regress your multicategory dependent variable on your predictors
The mlogit command in Stata fits a multinomial logistic regression model, also known as a polytomous logit model. The syntax is:
mlogit depvar indepvars, baseoutcome(#)
where depvar
is the categorical outcome variable, indepvars
are the predictor variables, and options
are some additional options for the model. One of the options is rrr
, which tells Stata to report the coefficients as relative risk ratios, instead of log odds. Another option is baseoutcome(#)
, which specifies the value of depvar
that will be the base or reference category. The default is to choose the most frequent category.
Without rrr option, the coefficients represent the log odds of being in the outcome category relative to the reference category, for a one-unit increase in the predictor variable, holding all other variables constant. Let’s interpret the specific results for the Northeast (NE) region with the log odds. However, please note that we never interpret coefficients/log odds in papers. So, we need to stick with interpreting ORs.
- Age: The coefficient of -0.0008355 for age suggests that as age increases by one year, the log odds of the outcome slightly decrease, but this is not statistically significant (p = 0.629), indicating that age may not have a meaningful impact on the outcome in this model.
- Race:
- Black: With a coefficient of -1.869474, being Black significantly decreases the log odds of the outcome compared to the reference race group, with a very significant p-value (p < 0.001). This indicates a strong negative association between being Black and the likelihood of the outcome.
- Other: A coefficient of -0.7457402 suggests that being in the “Other” race category decreases the log odds of the outcome compared to the reference race group, with this effect being statistically significant (p = 0.047).
- Sex (Female): The coefficient of -0.1072987 for females indicates that being female slightly decreases the log odds of the outcome compared to males (the likely reference group), though this result is marginally significant (p = 0.071).
- Rural (Rural): A coefficient of -1.12706 for rural residents suggests that living in a rural area significantly decreases the log odds of the outcome compared to non-rural residents, with a very significant p-value (p < 0.001).
To get the relative risk ratio (RRR), the command is:
mlogit region age i.race i.sex i.rural, rrr
This tells Stata to use the region
variable as the dependent variable, and to include age
as a continuous predictor, and race
, sex
, and rural
as categorical predictors. The i.
prefix before the categorical variables indicates that they are factor variables, and Stata will create dummy variables for each level. The rrr
option tells Stata to report the relative risk ratios.
The output shows the relative risk ratios for each outcome category, compared to the base category, which is Northeast by default. The relative risk ratio is the ratio of the probability of choosing a certain category over the probability of choosing the base category, for a one-unit increase in the predictor variable, holding other variables constant.
- The output also shows the standard errors, z-statistics, p-values, and 95% confidence intervals for each relative risk ratio. These are used to test the significance of the effects of the predictor variables.
- The output also shows the log likelihood, the Wald chi-square statistic, the p-value, and the pseudo R-squared for the overall model fit. The log likelihood is the value of the log likelihood function at the estimated coefficients.
- The Wald chi-square statistic is a test of the joint significance of all the coefficients in the model, excluding the intercepts.
- The p-value is the probability of obtaining a Wald chi-square statistic as large or larger than the observed one, under the null hypothesis that all the coefficients are zero.
- The pseudo R-squared is a measure of how well the model fits the data, compared to a model with no predictors. It ranges from 0 to 1, with higher values indicating better fit.
Interpretation of RRR
- Age: The RRR of 0.9991648 for age suggests that as age increases by one year, the risk of the outcome slightly decreases (RRR < 1), but this is not statistically significant (p = 0.629), indicating that age may not have a meaningful impact on the outcome in this model.
- Race:
- Black: With an RRR of 0.1542048, being Black significantly decreases the risk of the outcome by about 84.6% compared to the reference race group (RRR < 1, p < 0.000), holding other variables constant.
- Other: An RRR of 0.474383 suggests that being in the “Other” race category decreases the risk of the outcome by about 52.6% compared to the reference race group, and this is statistically significant (p = 0.047).
- Sex (Female): The RRR of 0.8982573 for females indicates that being female decreases the risk of the outcome by about 10.2% compared to males (the likely reference group), though this result is marginally significant (p = 0.071).
- Rural (Rural): An RRR of 0.3239843 for rural residents suggests that living in a rural area decreases the risk of the outcome by about 67.6% compared to non-rural residents, with high statistical significance (p < 0.000).
Interpretation of Relative Risk Ratio vs. Odds Ratio
Relative Risk (RR): Relative Risk, also known as Risk Ratio, compares the probability of an event occurring in two different groups. It’s a measure of how much the risk of the event (like developing a disease) in the exposed group compares to the risk in the unexposed group.
- Example: Imagine a study looking at the risk of developing lung cancer among smokers compared to non-smokers. If the risk of lung cancer in smokers is 20% and in non-smokers is 5%, the RR would be: 0.20/0.05 = 4.
- This means that smokers have 4 times the risk of developing lung cancer compared to non-smokers.
- Risk Ratio (RR) is generally more intuitive because it describes the relative risk directly. For example, an RR of 2 means that the exposed group is twice as likely to experience the event compared to the unexposed group. This direct interpretation of risk makes RR particularly appealing in public health and clinical research, where clear communication of risk is essential.
- Cohort Studies and Randomized Controlled Trials (RCTs) often report RRs because these studies follow groups over time and directly measure the incidence of outcomes, allowing for a direct calculation of risk.
Odds Ratio (OR): Odds Ratio, on the other hand, compares the odds of an event occurring in one group to the odds of it occurring in another group. It’s often used in case-control studies where RR cannot be directly calculated. Odds are calculated as the probability of the event occurring divided by the probability of the event not occurring.
- Example: Using the same scenario as above, let’s calculate the odds of developing lung cancer for smokers and non-smokers. If the odds in smokers are 1 to 4 (meaning for every one smoker who develops lung cancer, four do not), and in non-smokers, the odds are 1 to 19, the OR would be: OR = (1/4)(1/19)=19/4=4.75
- This suggests that the odds of developing lung cancer are 4.75 times higher in smokers compared to non-smokers.
- Odds Ratio (OR), while useful, can be less intuitive because it compares the odds of an event occurring rather than the direct risk. The OR can significantly overestimate the risk, especially when the outcome is common, which might lead to misinterpretation among non-specialists.
- Case-Control Studies typically use ORs because these studies start with the outcome and look backward to assess exposure. In such designs, the true risk or incidence rate is not directly measurable, making ORs a practical alternative.
When the outcome of interest is rare, ORs can approximate RRs closely, making the distinction less critical. However, as the outcome becomes more common, the OR increasingly overestimates the risk compared to the RR, making RRs more desirable for accurately conveying risk in these scenarios.
Step 2. Computing Marginal Effects: mchange
command
Then, you can use the margins
or mchange
command to present marginal effects of your predictors of interest. The mchange
command calculates the marginal change in the predicted probability of the outcome variable for a change in one or more explanatory variables, holding other variables constant.
mchange varname
You can see the average predictions of marginal effects of all categories in the multinomial logistic regression.
- For a 1 unit increase in age:
- Don’t know: The predicted probability increases by 0.003 (p-value = 0.000), indicating a statistically significant increase.
- App: The predicted probability decreases by 0.001 (p-value = 0.000), indicating a statistically significant decrease.
- Website: The change is 0.000 (p-value = 0.867), suggesting no significant change in the predicted probability.
- Both app and website: The predicted probability decreases by 0.002 (p-value = 0.000), indicating a statistically significant decrease.
- Pr(y|base): These are the base probabilities for each outcome category without considering the change in age. They represent the model’s predictions for the average individual in the dataset. Average predictions are essentially the predicted probabilities of each outcome category when the predictor variables are set to their average values (or baseline levels in categorical cases). These predictions give us a baseline scenario against which we can compare the effects of changes in predictor variables.
- Don’t know: 37.4%
- App: 10.9%
- Website: 32.4%
- Both app and website: 19.4%
Step 3. Computing odds ratios: listcoef
command
Even though the mlogit
command in Stata does not support or
option in terms of output, we can compute factor change in odds for unit increase in variable using user-created listcoef
command.
listcoef, help
- b = raw coefficient
- z = z-score for test of b=0
- P>|z| = p-value for z-test
- e^b = exp(b) = factor change in odds for unit increase in X (odds ratio)
- e^bStdX = exp(b*SD of X) = change in odds for SD increase in X
Step 4. Model Testing: mlogtest
command
To test the significant contributions to the model of each independent variables, we can conduct LR and Wald tests. To conduct LR and Wald tests followed by mlogit command in Stata, you need to use the mlogtest
command to conduct LR and Wald tests of key predictors of interest.
search mlogtest // need to install the package first
mlogtest // waldtest
mlogtest, lr // lrtest
Testing IIA Assumption
IIA is a critical assumption when using multinomial logit models. It states that the odds of preferring one choice over another do not depend on the presence or absence of other “irrelevant” alternatives. In simpler terms, the relative preferences between options remain consistent, regardless of other choices available.
- Suppose we’re studying people’s preferences for living in different regions: Northeast (NE), Midwest (MW), South (S), and West (W). Each person selects one region to live in. The IIA assumption implies that if someone prefers the Midwest (MW) over the Northeast (NE), their preference should remain the same even if we introduce a new option (say, the South or West).
- If IIA is violated, it means that the introduction of new alternatives affects people’s preferences. For instance, if adding the South (S) as an option suddenly makes more people prefer the Midwest (MW) over the Northeast (NE), then IIA is violated.
We can test IIA assumptions using mlogtest
command. Before running it, we need to first set a random seed by using set seed command. Using the same seed number, you can test the IIA assumption using the mlogtest, hausman
and mlogtest, smhsiao
commands.
set seed 153456
mlogtest, hausman
mlogtest, smhsiao
mlogtest, iia // you can run all iia assumption test at once
According to the results in Small-Hsiao Tests of IIA Assumption, all regions (NE, MW, S, W) have p-values well above 0.05, indicating no evidence against the IIA assumption. This means the odds of choosing one outcome over another are independent of the presence of other alternatives.
Regarding Hausman Tests of IIA Assumption results, the negative chi-square values for NE and S indicate that the model does not meet asymptotic assumptions for these cases. For MW and W, the chi-square values are not significant (p-values of 1.000 and 0.988, respectively), suggesting no evidence against the IIA assumption.
- If your P > Chi2 are significant (p < .05), you are violating IIA assumption.
- However, depending on your sample size, there is a debate on the utility/relevance of the IIA assumption test. You can see this post and cite it
- Allison, P. (2012). How relevant is the independence of irrelevant alternatives?. Statistical Horizons.
- You can also consider another model, such as the “mixed logit” model (MXL), that relaxes IIA assumptions. The user-created mixlogit command allows you to implement it. Please see this study as an example and this article for more information on the mixed logit model.
- However, depending on your sample size, there is a debate on the utility/relevance of the IIA assumption test. You can see this post and cite it
- If your P > Chi2 are NOT significant (p > .05), you are NOT violating the assumption and can move forward with your multinomial logit.
- Some of your test statistics are negative and it is also an evidence that IIA assumption has not been violated – Hausman and McFadden (1984, p. 1226)
Tip. Troubleshooting regarding mlogtest command
It seems mlogtest, hausman
and mlogtest, smhsiao
command does not work with if condition in the mlogit command. In other words, if you run mlogit with if condition, it will return the invalid syntax
error. I guess this is an error in the package, but you need to make sure not to use if condition to use mlogtest, hausman or mlogtest, smhsiao. You can drop the sample before running these command.
Further, if smhsiao test returns the error such as “basecategory not found
,” you can solve the problem by creating another variable with egen= group(varname) command.
egen outcome= group(depvar)
mlogit outcome independentvars
mlogtest, smhsiao
Please find this statalist post regarding this error.
Step 5. Model Fit Statistics: fitstat
command
We can use the fitstat
command to examine the overall model fit.
fitstat
- Log-likelihoods and Chi-square provide a basis for comparing models, with the chi-square test indicating the model is significantly better than an intercept-only model.
- R-squared values (McFadden, Cox-Snell/ML, etc.) offer insight into the model’s explanatory power, which appears to be relatively low (McFadden’s R2 is 0.034 – 3.4%), suggesting that the model explains only a small portion of the variance in the outcome.
- Information Criteria (AIC, BIC) help in model selection across different models, where lower values generally indicate a better model fit considering the complexity of the model. It is not feasible to interpret the model fit statistics with only one model.
Reference
Multinomial Logistic Regression | Stata Data Analysis Examples (ucla.edu)
multinom_st.pdf (washington.edu)
Interpreting multinomial logistic regression in Stata – BAILEY DEBARMORE
Thanks for your post.
Exist some alternative test when violates IIA Assumption?
gologit2 could be an alternative?
Thanks
You can either cite the study on the excuse of violation – such as the non-relevance of IIA assumption in small sample size. Alternatively, you can consider using mixlogitfor the mixed logit model. I updated the post for more information on it! gologit2 is an alternative for the case when you violate the assumption for ologit.