[Stata] Multinomial Logistic Regression: mlogit, mlogtest

Multinomial logistic regression is a method for modeling categorical outcomes with more than two levels. It allows us to estimate the probability of each outcome as a function of some predictor variables, and to test hypotheses about the effects of these variables.

In this blog post, I will use the nhanes2 dataset from Stata, which contains data on health and nutrition of a sample of US adults. I will use the variable region as the dependent variable, which has four categories: Northeast, Midwest, South, and West. I will use the variables age, race, sex, and rural as the predictors.

Step 1: Use the mlogit command to regress your multicategory dependent variable on your predictors

The mlogit command in Stata fits a multinomial logistic regression model, also known as a polytomous logit model. The syntax is:

Stata
mlogit depvar indepvars, baseoutcome(#)

where depvar is the categorical outcome variable, indepvars are the predictor variables, and options are some additional options for the model. One of the options is rrr, which tells Stata to report the coefficients as relative risk ratios, instead of log odds. Another option is baseoutcome(#), which specifies the value of depvar that will be the base or reference category. The default is to choose the most frequent category.

Without rrr option, the coefficients represent the log odds of being in the outcome category relative to the reference category, for a one-unit increase in the predictor variable, holding all other variables constant. Let’s interpret the specific results for the Northeast (NE) region with the log odds. However, please note that we never interpret coefficients/log odds in papers. So, we need to stick with interpreting ORs.

  • Age: The coefficient of -0.0008355 for age suggests that as age increases by one year, the log odds of the outcome slightly decrease, but this is not statistically significant (p = 0.629), indicating that age may not have a meaningful impact on the outcome in this model.
  • Race:
    • Black: With a coefficient of -1.869474, being Black significantly decreases the log odds of the outcome compared to the reference race group, with a very significant p-value (p < 0.001). This indicates a strong negative association between being Black and the likelihood of the outcome.
    • Other: A coefficient of -0.7457402 suggests that being in the “Other” race category decreases the log odds of the outcome compared to the reference race group, with this effect being statistically significant (p = 0.047).
  • Sex (Female): The coefficient of -0.1072987 for females indicates that being female slightly decreases the log odds of the outcome compared to males (the likely reference group), though this result is marginally significant (p = 0.071).
  • Rural (Rural): A coefficient of -1.12706 for rural residents suggests that living in a rural area significantly decreases the log odds of the outcome compared to non-rural residents, with a very significant p-value (p < 0.001).

To get the relative risk ratio (RRR), the command is:

Stata
mlogit region age i.race i.sex i.rural, rrr

This tells Stata to use the region variable as the dependent variable, and to include age as a continuous predictor, and race, sex, and rural as categorical predictors. The i. prefix before the categorical variables indicates that they are factor variables, and Stata will create dummy variables for each level. The rrr option tells Stata to report the relative risk ratios.

The output shows the relative risk ratios for each outcome category, compared to the base category, which is Northeast by default. The relative risk ratio is the ratio of the probability of choosing a certain category over the probability of choosing the base category, for a one-unit increase in the predictor variable, holding other variables constant.

  • The output also shows the standard errors, z-statistics, p-values, and 95% confidence intervals for each relative risk ratio. These are used to test the significance of the effects of the predictor variables.
  • The output also shows the log likelihood, the Wald chi-square statistic, the p-value, and the pseudo R-squared for the overall model fit. The log likelihood is the value of the log likelihood function at the estimated coefficients.
    • The Wald chi-square statistic is a test of the joint significance of all the coefficients in the model, excluding the intercepts.
    • The p-value is the probability of obtaining a Wald chi-square statistic as large or larger than the observed one, under the null hypothesis that all the coefficients are zero.
    • The pseudo R-squared is a measure of how well the model fits the data, compared to a model with no predictors. It ranges from 0 to 1, with higher values indicating better fit.

Interpretation of RRR

  • Age: The RRR of 0.9991648 for age suggests that as age increases by one year, the risk of the outcome slightly decreases (RRR < 1), but this is not statistically significant (p = 0.629), indicating that age may not have a meaningful impact on the outcome in this model.
  • Race:
    • Black: With an RRR of 0.1542048, being Black significantly decreases the risk of the outcome by about 84.6% compared to the reference race group (RRR < 1, p < 0.000), holding other variables constant.
    • Other: An RRR of 0.474383 suggests that being in the “Other” race category decreases the risk of the outcome by about 52.6% compared to the reference race group, and this is statistically significant (p = 0.047).
  • Sex (Female): The RRR of 0.8982573 for females indicates that being female decreases the risk of the outcome by about 10.2% compared to males (the likely reference group), though this result is marginally significant (p = 0.071).
  • Rural (Rural): An RRR of 0.3239843 for rural residents suggests that living in a rural area decreases the risk of the outcome by about 67.6% compared to non-rural residents, with high statistical significance (p < 0.000).

Interpretation of Relative Risk Ratio vs. Odds Ratio

Relative Risk (RR): Relative Risk, also known as Risk Ratio, compares the probability of an event occurring in two different groups. It’s a measure of how much the risk of the event (like developing a disease) in the exposed group compares to the risk in the unexposed group.

  • Example: Imagine a study looking at the risk of developing lung cancer among smokers compared to non-smokers. If the risk of lung cancer in smokers is 20% and in non-smokers is 5%, the RR would be: 0.20/0.05 = 4.
  • This means that smokers have 4 times the risk of developing lung cancer compared to non-smokers.
  • Risk Ratio (RR) is generally more intuitive because it describes the relative risk directly. For example, an RR of 2 means that the exposed group is twice as likely to experience the event compared to the unexposed group. This direct interpretation of risk makes RR particularly appealing in public health and clinical research, where clear communication of risk is essential.
  • Cohort Studies and Randomized Controlled Trials (RCTs) often report RRs because these studies follow groups over time and directly measure the incidence of outcomes, allowing for a direct calculation of risk.

Odds Ratio (OR): Odds Ratio, on the other hand, compares the odds of an event occurring in one group to the odds of it occurring in another group. It’s often used in case-control studies where RR cannot be directly calculated. Odds are calculated as the probability of the event occurring divided by the probability of the event not occurring.

  • Example: Using the same scenario as above, let’s calculate the odds of developing lung cancer for smokers and non-smokers. If the odds in smokers are 1 to 4 (meaning for every one smoker who develops lung cancer, four do not), and in non-smokers, the odds are 1 to 19, the OR would be: OR = (1/4)(1/19)=19/4=4.75
  • This suggests that the odds of developing lung cancer are 4.75 times higher in smokers compared to non-smokers.
  • Odds Ratio (OR), while useful, can be less intuitive because it compares the odds of an event occurring rather than the direct risk. The OR can significantly overestimate the risk, especially when the outcome is common, which might lead to misinterpretation among non-specialists.
  • Case-Control Studies typically use ORs because these studies start with the outcome and look backward to assess exposure. In such designs, the true risk or incidence rate is not directly measurable, making ORs a practical alternative.

When the outcome of interest is rare, ORs can approximate RRs closely, making the distinction less critical. However, as the outcome becomes more common, the OR increasingly overestimates the risk compared to the RR, making RRs more desirable for accurately conveying risk in these scenarios.

Step 2. Computing Marginal Effects: mchange command

Then, you can use the margins or mchange command to present marginal effects of your predictors of interest. The mchange command calculates the marginal change in the predicted probability of the outcome variable for a change in one or more explanatory variables, holding other variables constant.

Stata
mchange varname

You can see the average predictions of marginal effects of all categories in the multinomial logistic regression.

  • For a 1 unit increase in age:
    • Don’t know: The predicted probability increases by 0.003 (p-value = 0.000), indicating a statistically significant increase.
    • App: The predicted probability decreases by 0.001 (p-value = 0.000), indicating a statistically significant decrease.
    • Website: The change is 0.000 (p-value = 0.867), suggesting no significant change in the predicted probability.
    • Both app and website: The predicted probability decreases by 0.002 (p-value = 0.000), indicating a statistically significant decrease.
  • Pr(y|base): These are the base probabilities for each outcome category without considering the change in age. They represent the model’s predictions for the average individual in the dataset. Average predictions are essentially the predicted probabilities of each outcome category when the predictor variables are set to their average values (or baseline levels in categorical cases). These predictions give us a baseline scenario against which we can compare the effects of changes in predictor variables.
    • Don’t know: 37.4%
    • App: 10.9%
    • Website: 32.4%
    • Both app and website: 19.4%

Step 3. Computing odds ratios: listcoef command

Even though the mlogit command in Stata does not support or option in terms of output, we can compute factor change in odds for unit increase in variable using user-created listcoef command.

Stata
listcoef, help
  • b = raw coefficient
  • z = z-score for test of b=0
  • P>|z| = p-value for z-test
  • e^b = exp(b) = factor change in odds for unit increase in X (odds ratio)
  • e^bStdX = exp(b*SD of X) = change in odds for SD increase in X

Step 4. Model Testing: mlogtest command

To test the significant contributions to the model of each independent variables, we can conduct LR and Wald tests. To conduct LR and Wald tests followed by mlogit command in Stata, you need to use the mlogtest command to conduct LR and Wald tests of key predictors of interest.

Stata
search mlogtest // need to install the package first 
mlogtest // waldtest
mlogtest, lr // lrtest

Testing IIA Assumption

IIA is a critical assumption when using multinomial logit models. It states that the odds of preferring one choice over another do not depend on the presence or absence of other “irrelevant” alternatives. In simpler terms, the relative preferences between options remain consistent, regardless of other choices available.

  • Suppose we’re studying people’s preferences for living in different regions: Northeast (NE), Midwest (MW), South (S), and West (W). Each person selects one region to live in. The IIA assumption implies that if someone prefers the Midwest (MW) over the Northeast (NE), their preference should remain the same even if we introduce a new option (say, the South or West).
  • If IIA is violated, it means that the introduction of new alternatives affects people’s preferences. For instance, if adding the South (S) as an option suddenly makes more people prefer the Midwest (MW) over the Northeast (NE), then IIA is violated.

We can test IIA assumptions using mlogtest command. Before running it, we need to first set a random seed by using set seed command. Using the same seed number, you can test the IIA assumption using the mlogtest, hausman and mlogtest, smhsiao commands.

Stata
set seed 153456
mlogtest, hausman
mlogtest, smhsiao
mlogtest, iia // you can run all iia assumption test at once

According to the results in Small-Hsiao Tests of IIA Assumption, all regions (NE, MW, S, W) have p-values well above 0.05, indicating no evidence against the IIA assumption. This means the odds of choosing one outcome over another are independent of the presence of other alternatives.

Regarding Hausman Tests of IIA Assumption results, the negative chi-square values for NE and S indicate that the model does not meet asymptotic assumptions for these cases. For MW and W, the chi-square values are not significant (p-values of 1.000 and 0.988, respectively), suggesting no evidence against the IIA assumption.

  • If your P > Chi2 are significant (p < .05), you are violating IIA assumption.
    • However, depending on your sample size, there is a debate on the utility/relevance of the IIA assumption test. You can see this post and cite it
      • Allison, P. (2012). How relevant is the independence of irrelevant alternatives?. Statistical Horizons.
    • You can also consider another model, such as the “mixed logit” model (MXL), that relaxes IIA assumptions. The user-created mixlogit command allows you to implement it. Please see this study as an example and this article for more information on the mixed logit model.
  • If your P > Chi2 are NOT significant (p > .05), you are NOT violating the assumption and can move forward with your multinomial logit.
Tip. Troubleshooting regarding mlogtest command

It seems mlogtest, hausman and mlogtest, smhsiao command does not work with if condition in the mlogit command. In other words, if you run mlogit with if condition, it will return the invalid syntax error. I guess this is an error in the package, but you need to make sure not to use if condition to use mlogtest, hausman or mlogtest, smhsiao. You can drop the sample before running these command.

Further, if smhsiao test returns the error such as “basecategory not found,” you can solve the problem by creating another variable with egen= group(varname) command.

Stata
egen outcome= group(depvar)
mlogit outcome independentvars
mlogtest, smhsiao

Please find this statalist post regarding this error.

Step 5. Model Fit Statistics: fitstat command

We can use the fitstat command to examine the overall model fit.

Stata
fitstat
  • Log-likelihoods and Chi-square provide a basis for comparing models, with the chi-square test indicating the model is significantly better than an intercept-only model.
  • R-squared values (McFadden, Cox-Snell/ML, etc.) offer insight into the model’s explanatory power, which appears to be relatively low (McFadden’s R2 is 0.034 – 3.4%), suggesting that the model explains only a small portion of the variance in the outcome.
  • Information Criteria (AIC, BIC) help in model selection across different models, where lower values generally indicate a better model fit considering the complexity of the model. It is not feasible to interpret the model fit statistics with only one model.

Reference

Multinomial Logistic Regression | Stata Data Analysis Examples (ucla.edu)

Mlogit1.pdf (nd.edu)

Mlogit2.pdf (nd.edu)

multinom_st.pdf (washington.edu)

A Hands-on Tutorial – Logit, Ordered Logit, and Multinomial Logit Models in Stata – Research Guides at Princeton University

Interpreting multinomial logistic regression in Stata – BAILEY DEBARMORE

  • February 27, 2024