[Stata] Ordinal Logistic Regression: ologit, omodel, oparallel, gologit2
Ordinal logistic regression (ordered logistic regression) is a type of regression analysis that is used when the outcome variable is categorical and ordered, such as low, medium, and high. You can understand the ordinal logit regression (and its proportional odds assumption) by watching this video.
In this blog post, I will demonstrate how to perform ordinal logistic regression in Stata using the webuse nhanes2 data set. The outcome variable is hlthstat, which measures the perceived health status of the respondents. The predictor variables are age, race, sex, body mass index (bmi), and high blood pressure (highbp).
Data
The webuse nhanes2 data set contains information on 10,351 adults from the second National Health and Nutrition Examination Survey (NHANES II) conducted in 1976-1980. The variables of interest are:
- hlthstat: perceived health status, coded as 1 = excellent, 2 = very good, 3 = good, 4 = fair, and 5 = poor.
- age: age in years.
- race: race, coded as 1 = white, 2 = black, and 3 = other.
- sex: sex, coded as 1 = male and 2 = female.
- bmi: body mass index, calculated as weight in kilograms divided by height in meters squared.
- highbp: high blood pressure, coded as 1 = yes and 0 = no.
To load the data in Stata, use the following command:
webuse nhanes2
Step 1. Descriptive Statistics
To get some descriptive statistics of the outcome and predictor variables, use the following commands:
fre hlthstat race sex highbp
sum bmi, detail
For this case, to indicate the better health status for the higher scores, I would recode this variable:
labrec hlthstat (1=5)(2=4)(4=2)(5=1)(8=.)
The labrec command recode values with their labels as follows 😊
Step 2. Ordinal Logistic Regression Model
To fit an ordinal logistic regression model with hlthstat as the outcome variable and age, race, sex, bmi, and highbp as the predictor variables, use the following command. The option or
tells Stata to report the odds ratios instead of the coefficients. Please always use or
option to have odds ratio output, since the coefficients are not easily interpreatble. The output shows the following results:
ologit hlthstat age i.race i.sex bmi highbp, or
The output shows the odds ratios for each predictor variable, along with their standard errors, z-scores, p-values, and 95% confidence intervals. The odds ratios can be interpreted as the change in the odds of being in a higher category of hlthstat for a one-unit increase in the predictor variable, holding all other variables constant.
- For example, the odds ratio for age is 0.96, which means that for every one-year increase in age, the odds of being in a higher category of hlthstat (i.e., better health status) decreased by 4%, holding all other variables constant.
- The odds ratio for race is 0.43 for black and 0.77 for other, which means that compared to white respondents, black respondents have 57% lower odds and other respondents have 23% lower odds of being in a higher category of hlthstat, holding all other variables constant.
Instead of constant in other models, the output also shows the cut points for the ordinal logistic regression model, which are the thresholds that separate the categories of the outcome variable. In ordered logistic regression, we model the probability of an outcome falling into different ordered categories. These cut points can be used to compute the probability of a case falling into a particular interval on the dependent variable.
In the output, we have four cut points in your output: /cut1
, /cut2
, /cut3
, and /cut4
. These cut points are essentially the values at which the latent variable is divided into the observable categories of hlthstat
. The interpretation is somewhat abstract because the latent variable itself is not directly observable. However, you can think of these cut points in the following way:
- /cut1: The value of the latent variable below
-5.541712
corresponds to the lowest category of health status (e.g., poor health). - /cut2: A value between
-5.541712
and-4.055599
corresponds to the next higher category (e.g., fair health). - /cut3: A value between
-4.055599
and-2.624712
corresponds to a further higher category (e.g., good health). - /cut4: A value between
-2.624712
and-1.364634
corresponds to the next category (e.g., very good health). - Above /cut4: A value above
-1.364634
corresponds to the highest category (e.g., excellent health).
In practice, you wouldn’t use these cut points directly to categorize individual observations. Instead, they are part of the model’s internals, helping to understand how the probability of being in a particular category of health status changes with the independent variables. The model calculates the probability of an individual falling into each category of hlthstat based on their characteristics (age, race, sex, BMI, high blood pressure). The cut points then help determine the most likely category based on these probabilities.
Step 3. Model Diagnostics
To test the hypothesis that any two of the predictor variables are simultaneously equal to 0, we can use the lrtest command, which performs a likelihood ratio test. For example, to test the hypothesis that race and sex have no effect on hlthstat, we can use the following commands:
ologit hlthstat bmi highbp, or
estimates store reduced
ologit hlthstat age i.race i.sex bmi highbp, or
estimates store full
lrtest reduced full
The output shows the following results:
The output shows that we can reject the null hypothesis that demographic characteristics (age, race and sex) have no effect on hlthstat, and conclude that they are significant predictors in the model.
To obtain the predicted probabilities for each category of hlthstat, we need to first identify the number of categories in the outcome variable, which is one more than the number of cut points. In this case, there are 5 categories and 4 cut points. We can then create new variables for each category of the predicted probabilities using the predict command with the pr option. For example, to create a variable for the predicted probability of being in the first category (poor), we can use the following command:
predict pr1, pr outcome(1)
We can repeat this command for the other categories, changing the outcome option accordingly. To graph the predicted probabilities for each category, we can use the dotplot command with the over option. For example, to graph the predicted probability of being in the first category by race, we can use the following command:
dotplot pr1, over(race)
The output shows the following graph:
The graph shows that the predicted probability of being in the first category (poor) is lower for white respondents than for black or other respondents, holding all other variables constant. You can also double-check if this claim is true by running a one-way ANOVA with a predicted probability variable.
According to the table, you can see that the mean predicted probability of being in the poor category is the lowest among Whites, while Black respondents have the highest probability with the highest standard deviation.
You can also predict the probability of all categories in our dependent variables – with 5 levels from poor to excellent. You can run the following code as an example (the number of pr would be changed according to the number of categories in your case):
predict pr1 pr2 pr3 pr4 pr5, pr
As shown above, it calculates the predicted probability for each category for all observations in the dataset, based on the independent variables in the model. You can also visualize the distribution of these predicted probabilities using dotplot.
dotplot pr1 pr2 pr3 pr4 pr5
Checking the model assumption: Proportional odds assumption
The proportional odds assumption is a key assumption for ordinal logistic regression. The proportional odds assumption means that the effect of the predictor variables (such as age, race, sex, bmi, and highbp) on the outcome variable (such as perceived health status) is the same (or constant) across all the categories of the outcome variable. Because of this assumption, there is only one coefficient/odds ratio for one predictor (compared to multinomial logit – which has multiple odds ratios across the pairs).
For example, suppose we want to know how age affects the perceived health status of the respondents. We can use ordinal logistic regression to estimate the odds ratio of age, which tells us how much the odds of being in a higher category of perceived health status (such as very high versus high, or high versus medium, or medium versus low) change for every one-year increase in age, holding all other variables constant. The proportional odds assumption means that this odds ratio of age is the same for all the comparisons of the categories of perceived health status. In other words, the effect of age on perceived health status is proportional across all the categories.
To illustrate this with some hypothetical numbers, suppose the odds ratio of age is 0.9, which means that for every one-year increase in age, the odds of being in a higher category of perceived health status decrease by 10%, holding all other variables constant. The proportional odds assumption means that this odds ratio of 0.9 applies to all the comparisons of the categories of perceived health status, such as:
- The odds of being in the excellent category versus being in the very good, good, fair, or poor category
- The odds of being in the very good category versus being in the good, fair, or poor category
- The odds of being in the good category versus being in the fair or poor category
- The odds of being in the fair category versus being in the poor category
The proportional odds assumption can be tested using various methods, such as the Brant test, the likelihood ratio test, or the score test (all tests can be conducted using oparallel
command in Stata).
Approach 1: Brant test of proportionality of odds
To check if the model meets the assumptions of parallel odds, we can use the brant
command, which performs a Brant test of proportionality of odds. The output shows the following results:
brant, detail
The output shows the Brant test statistic for each predictor variable, along with their p-values. The null hypothesis of the Brant test is that the odds ratios are equal across the categories of the outcome variable, which means that the parallel odds assumption holds.
- If the p-value is less than 0.05, we can reject the null hypothesis and conclude that the parallel odds assumption is violated for that predictor variable. The output shows that age and 2.sex variables have a p-value less than 0.05, which means that we can reject the null hypothesis and conclude that the parallel odds assumption does not hold for all predictor variables.
🔥Troubleshooting with Brant test in Stata
Error type 1: “not all independent variables can be retained in binary logits. brant test cannot be computed”
The sample size per category causes the error: https://www.statalist.org/forums/forum/general-stata-discussion/general/1384604-assumption-of-proportional-odds-brant-in-an-ordered-logistic-regression.
One solution suggested is to reduce the number of categories by grouping. You could consider grouping some variables considering their distribution.
Error type 2: “operator invalid”
This error could be resolved by reinstalling the package for the brant test. You can use the following code.
net install spost13_ado.pkg, from(https://jslsoc.sitehost.iu.edu/stata) force
Approach 2: omodel
command
Another way to test the parallel odds assumption is to use the omodel
command, which returns the same results with ologit
command, but it also produces an LR test to see whether the coefficients are equal across categories (proportional odds assumption).
ssc install omodel
omodel logit hlthstat age i.race i.sex bmi highbp, or
// NOTE: we can't use factor (i.) notation in this command
tab race, gen(race)
tab sex, gen(sex)
omodel logit hlthstat age bmi highbp race1 race2 race3 sex1 sex2
// run after dummy coding
The output shows the following results:
The omodel
command also performs a likelihood ratio test to test the proportional odds assumption, which means whether the coefficients are equal across categories. In the example, the significant p-value (p < .001) suggests that the proportional odds assumption is violated, meaning the relationship between each pair of outcome groups and the predictors is not consistent across all groups.
This test is calculated based on the test that the odds ratio can be calculated for each predictor variable and each category of the outcome variable, and then combined to form a measure of proportionality of odds across categories. This measure can help us test whether the parallel odds assumption is violated or not, which means that the odds ratios are constant across categories.
Approach 3: oparallel
command
Another command that can be used to test the parallel odds assumption is the oparallel
command, which performs a score test of proportionality of odds. The output shows the following results:
ssc install oparallel
oparallel, ic
oparallel is a post-estimation command testing the parallel regression assumption in an ordered logit model. By default, it performs five tests: a likelihood ratio test, a score test, a Wald test, a Wolfe-Gould test, and a Brant test.
These tests compare an ordered logit model with the fully generalized ordered logit model, which relaxes the parallel regression assumption on all explanatory variables.
- If your P > Chi2 are significant (p < .05), you are violating the proportional odds assumption.
- If your P > Chi2 are NOT significant (p > .05), you are NOT violating the assumption and can move forward with your ordered logit.
What if the proportional odds assumption is violated?
If the proportional odds assumption is violated, it means that the effect of the predictor variables on the outcome variable is not the same across all the categories of the outcome variable. In our example, we have confirmed that the parallel odds assumption holds. Therefore, we do not need to deal with any problem. However, if the parallel odds assumption was violated, we could try some of the following strategies:
- Use the
gologit2
command to fit a partial proportional odds model, which allows for some of the predictor variables to have different odds ratios across the categories of the outcome variable, while keeping the others constant. - Use the
mlogit
command to fit a multinomial logistic regression model, which does not assume any order in the outcome variable, and compare the results with the ordinal logistic regression model. - Use the
oglm
command to fit an alternative link function, such as theprobit
or the complementary log-log, which may fit the data better than the logit link function.
ssc install gologit2
gologit2 hlthstat age i.race i.sex bmi highbp, or
// gologit2 returns the generalized ordered logit estimates
Step 4. Computing Marginal Effects
To compute the marginal effects of the predictor variables on the outcome variable, we can use the mchange
command, which calculates the change in the predicted probabilities for a discrete change in the predictor variable, holding all other variables constant. For example, to compute the marginal effect of highbp on hlthstat, we can use the following command:
mchange
mchange highbp // you can specify the variable
The output shows the following results:
One good feature of mchange command is to get the change, start, and end values with p-values. You can use the following options.
mchange, stats(change p start end)
The “From” and “To” values in the output provide additional context to the changes in predicted probabilities (Pr(y)) for different health status categories based on race and sex. These values represent the estimated probabilities of being in each health status category before (From) and after (To), accounting for the specified change (e.g., Black vs White, Female vs Male).
Race:
- Black vs White:
- Being Black (compared to White) is associated with a higher probability of reporting Poor (+0.069) and Fair (+0.088) health statuses and a lower probability of reporting Very good (-0.064) and Excellent (-0.115) health statuses. This indicates that Black individuals have a statistically significantly worse reported health status compared to White individuals.
- For Black individuals compared to White, the “From” probabilities indicate the estimated baseline probabilities for White individuals across the health status categories. The “To” probabilities show how these probabilities change for Black individuals. For instance, the probability of reporting Poor health status increases from 0.063 to 0.133 when comparing Black to White individuals, indicating a substantial increase in the likelihood of reporting Poor health for Black individuals.
- The p-values are 0.000 for all comparisons, indicating that these differences are statistically significant (p < 0.0001).
- Other vs White:
- Being of Other race (compared to White) is associated with a slightly higher probability of reporting Poor (+0.016) and Fair (+0.026) health statuses and a lower probability of reporting Very good (-0.016) and Excellent (-0.041) health statuses, suggesting somewhat worse health outcomes than Whites, though to a lesser extent than the Black vs. White comparison.
- Most of these differences are statistically significant, with p-values below 0.05.
- Other vs Black:
- Comparing Other race to Black shows a decrease in the probability of reporting Poor (-0.053) and Fair (-0.062) health statuses and an increase in the probability of reporting Very good (+0.048) and Excellent (+0.075) health statuses. This suggests that individuals of Other races report better health outcomes than Black individuals, with statistically significant differences.
Sex:
- Female vs Male:
- Being Female (compared to Male) is associated with a slightly higher probability of reporting Poor (+0.008) and Fair (+0.013) health statuses and a lower probability of reporting Excellent (-0.021) health status. This suggests that females have a marginally worse reported health status compared to males.
- These differences are statistically significant across all health status categories, indicated by p-values of 0.000.
Step 5. Model Fit Statistics
To conduct a formal global test of the model fit, we can use the fitstat
command, which reports various goodness-of-fit statistics for the ordinal logistic regression model. The output shows various measures of fit for the ordinal logistic regression model, such as the log-likelihood, the likelihood ratio chi-square, the pseudo R-squared, the Akaike information criterion (AIC), and the Bayesian information criterion (BIC).
fitstat
- Log-likelihoods and Chi-square provide a basis for comparing models, with the chi-square test indicating the model is significantly better than an intercept-only model.
- R-squared values (McFadden, Cox-Snell/ML, etc.) offer insight into the model’s explanatory power, which appears to be relatively low (McFadden’s R2 is 0.056 – 5.6%), suggesting that the model explains only a small portion of the variance in the outcome.
- Information Criteria (AIC, BIC) help in model selection across different models, where lower values generally indicate a better model fit considering the complexity of the model. It is not feasible to interpret the model fit statistics with only one model.
reference
www3.nd.edu/~rwilliam/stats3/Ologit01.pdf
Ordered Logistic Regression | Stata Data Analysis Examples (ucla.edu)
ncrm.ac.uk/resources/online/ordinal_logistic_regression/downloads/cpuworkshop.pdf
https://jslsoc.sitehost.iu.edu/stata/cdaguide14/cdaicpsr2014%20labguide%202014-05-30@1color.pdf