Probability, Odds, Log-Odds, and Log-Likelihood in Logistic Regression

Logistic regression is commonly used to model a binary outcome (e.g. whether an individual accesses mental health services: Yes or No). Key concepts in logistic regression include probabilityoddslog-odds (logit), odds ratio, and log-likelihood. Below, we explain each concept with definitions, formulas, and examples, and show how they relate to each other.

Probability in Logistic Regression

Probability is the chance or likelihood of an event happening. In logistic regression, we predict the probability that the outcome Y=1 (e.g. accessed services) given certain predictors X. This predicted probability is often denoted p(X) = P(Y=1 \mid X=x), and it ranges from 0 to 1.

Logistic regression uses the logistic function to ensure that predicted probabilities stay between 0 and 1 because probabilities must be within this range by definition. Here’s why this is necessary. If we used OLS regression for a binary outcome (e.g., whether someone accesses mental health services: Yes = 1, No = 0), the predicted values could extend beyond 0 or 1. Consider the linear model: p_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots

This equation has no constraints on its output—it can produce any value from negative infinity to positive infinity. However, probabilities must be between 0 and 1. If we allow predictions like -0.5 or 1.3, they don’t make sense as probabilities.

The Logistic Function Maps Any Value to [0,1]

To fix this, logistic regression applies the logistic (sigmoid) function, which squashes any real number into the range [0,1]: p_i = \frac{1}{1 + \exp[-(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots)]}

This transformation has two key properties:

  • If the linear predictor (\beta_0 + \beta_1 x_{i1} + \dots) is very large (e.g., 10), the exponent \exp(-\text{large number}) becomes very small, making p_i approach 1.
  • If the linear predictor is very small (e.g., -10), the exponent \exp(-\text{negative large number}) becomes very large, making p_i approach 0.
  • If the linear predictor is 0, the function simplifies to 0.5: p = \frac{1}{1 + e^0} = \frac{1}{2}

Thus, the logistic function constrains predictions between 0 and 1, ensuring valid probability values.

The logistic function is S-shaped, meaning:

  • Predictions smoothly transition from near 0 to near 1.
  • Small changes in predictors have a small effect on probability when p is close to 0 or 1 but a large effect when p is near 0.5.
  • This makes logistic regression robust to extreme predictor values.

Log-odds is the logarithm of the ratio of the probability of an event occurring to the probability of it not occurring. The formula is \text{log-odds} = \ln \left( \frac{p}{1 - p} \right) . Since log-odds can be any real number ( -\infty to \infty ), this linear formulation is mathematically convenient. The logistic function is the inverse of the logit transformation, allowing us to model probabilities properly.

If the log-odds is 0, then the probability is 0.5 (50%). A positive log-odds means the probability is greater than 50%, while a negative log-odds means the probability is less than 50%.

The odds ratio helps us understand how odds change when a predictor variable increases by one unit. It is derived by exponentiating the log-odds coefficient. The formula is \text{odds ratio} = e^{\beta} .

Since logistic regression models log-odds linearly, the coefficients from a logit model are in terms of log-odds, which are difficult to interpret directly. A coefficient of 0.7, for example, does not have an intuitive meaning in log-odds. However, by exponentiating it, we get the odds ratio e^{0.7} \approx 2.01 , which means that for a one-unit increase in the predictor, the odds increase by 2.01 times. If the coefficient is negative, such as \beta = -0.7 , then the odds ratio is e^{-0.7} \approx 0.5 , indicating that the odds decrease to 50% of their original value.

We use the odds ratio because it provides a more interpretable way to express the effect of a predictor. While log-odds change linearly, odds ratio changes exponentially, making it easier to understand how much a predictor increases or decreases the likelihood of an event.

Since the logistic model is linear in log-odds, exponentiating both sides gives us the odds ratio: \frac{p}{1-p} = e^{\beta_0 + \beta_1 x_1 + \dots}

  • In logistic regression, each coefficient ( \beta) represents the change in log-odds per unit change in the predictor.
  • Exponentiating a coefficient (e^\beta) gives the odds ratio, meaning the multiplicative change in odds for a one-unit increase in the predictor.
e calculator: https://www.omnicalculator.com/math/e-power-x

Why do we use odds ratios instead of probability?

Before understanding why we use odds ratios, let’s break down how probability, odds, and odds ratio relate:

  • Probability ( p ): The chance of an event happening.
    p = \frac{\text{favorable outcomes}}{\text{total outcomes}}
    • Example: If 60% of insured individuals access mental health services, p = 0.6.
  • Odds: The ratio of the probability of an event happening to the probability of it not happening.
    \text{Odds} = \frac{p}{1-p}
    • Example: If p = 0.6, then \text{Odds} = \frac{0.6}{1 - 0.6} = \frac{0.6}{0.4} = 1.5 .
    • This means the event is 1.5 times more likely to happen than not.
  • Odds Ratio (OR): The ratio of two odds (e.g., the odds of the event happening in Group A vs. Group B).
    \text{OR} = \frac{\text{Odds in Group A}}{\text{Odds in Group B}}
    • Example: If insured individuals have odds of 1.5 and uninsured individuals have odds of 0.5, the odds ratio is: OR = \frac{1.5}{0.5} = 3
    • This means insured individuals have 3 times the odds of accessing mental health services compared to uninsured individuals.

(A) Probability Changes Non-Linearly, While Odds Ratios Are Constant

One key reason for using odds ratios is that probability does not change at a constant rate across different baseline probabilities.

Example: Interpreting Coefficients in Probability vs. Odds

Suppose we find that having insurance increases the probability of accessing services by 20 percentage points. This sounds simple, but it depends on the baseline probability:

Baseline Probability (p) Without InsuranceProbability With Insurance (p + 0.2)
0.10 (10%)0.30 (30%)
0.40 (40%)0.60 (60%)
0.70 (70%)0.90 (90%)
  • The absolute increase is always 0.2, but the relative increase varies significantly depending on where you start.
  • The impact of a predictor is not consistent in probability terms.

Odds ratios, however, remain consistent regardless of baseline probability. If the odds ratio is 3, it means the odds of accessing services are tripled no matter what the baseline odds were.


(B) Easier Interpretation in Multiplicative Terms

Odds ratios allow us to say:

  • “People with insurance have 3 times the odds of accessing mental health services compared to those without insurance.”
  • “A one-unit increase in income increases the odds of accessing services by 20%.”
    (If OR = 1.2, meaning the odds multiply by 1.2 for each income unit.)

These statements are more interpretable in many research contexts.


(C) Consistency Across Logistic Regression Models

Since logistic regression models log-odds, interpreting coefficients directly in terms of probability is difficult. Instead, we:

  1. Exponentiate the coefficient to get an odds ratio: OR = e^{\beta}
  2. Interpret the OR as a multiplicative effect on odds.
    • OR > 1: Predictor increases odds.
    • OR < 1: Predictor decreases odds.
    • OR = 1: Predictor has no effect.

Example: If insurance status ( \beta = \ln(3) \approx 1.099), then OR = e^{1.099} = 3 meaning having insurance triples the odds of accessing services.

Probability and odds are different ways to express likelihood. Probability ranges 0 to 1, while odds range 0 to \infty for 0<p<1 . The transformation from probability to odds is monotonic (higher p gives higher odds)​ (See this article: stats.oarc.ucla.edu). Small probabilities and odds are very similar in value (e.g. p=0.05 gives odds 0.052), but they diverge as probabilities grow large.

Odds Ratio

An Odds Ratio (OR) compares the odds of an event between two different conditions or groups. It is a relative measure defined as the ratio of two odds. For example, if Group A has odds O_A and Group B has odds O_B of accessing mental health services, the odds ratio is \frac{O_A}{O_B} .

Odds ratios are often reported in health research because they are easier to interpret than raw log-odds coefficient. An OR = 1 indicates no association. OR > 1 indicates higher odds of the outcome (more likely), and OR < 1 indicates lower odds (less likely) in the group of interest (or per unit increase of a predictor). Always pay attention to which group or condition is the reference.

  • In Logistic Regression: The odds ratio is often used to interpret logistic regression coefficients. In a logistic model, a coefficient \beta for a predictor represents the log change in odds for a one-unit increase in that predictor. If \beta = 0.693 , then the odds ratio for a one-unit increase in that predictor is \exp(0.693) \approx 2.0. This means the odds are doubled. Generally, \text{OR} = e^{\beta} for a one-unit change in a predictor. An OR > 1 indicates the odds increase with the predictor; OR < 1 indicates the odds decrease with the predictor; OR = 1 means no change (the predictor has no effect on odds).

Example

GroupProbability (p)Odds (p / (1-p))Log-Odds ( \ln(Odds) )Odds Ratio (vs reference)
Insured0.510Ref
Uninsured0.250.333333-1.0993.0 (vs Insured)

Suppose the odds of accessing services for individuals with health insurance is 1.0 (meaning 50% probability, as above), and the odds for those without insurance is 0.333 (33.3% probability). The odds ratio (insured vs. uninsured) would be 1.0 / 0.333 \approx 3.0. This OR = 3 means the odds of accessing services are three times higher for insured individuals compared to uninsured individuals. In terms of probability, the insured group’s probability (50%) is higher than the uninsured’s (25%), but the OR quantifies the multiplicative difference in odds.

If a logistic regression finds that the coefficient for having insurance is \beta = \ln(3.0) \approx 1.099 , then \exp(1.099) = 3.0 is the odds ratio. It implies having insurance multiplies the odds of accessing services by 3, compared to not having insurance. Similarly, a negative coefficient would yield an OR less than 1. For instance, a coefficient of -0.2007 corresponds to \exp(-0.2007) \approx 0.8185 , meaning the odds are about 0.82 times (i.e. 18% lower than) the baseline​.

Log-Likelihood in Logistic Regression

The log-likelihood is a measure of model fit, derived from the likelihood of the observed data given the model parameters. In logistic regression, each observation’s contribution to the likelihood is the probability of its observed outcome. For a single observation i with outcome y_i (which is 1 for “accessed services” or 0 for “did not access”), the contribution to the likelihood is:

  • p_i if y_i = 1 (because p_i is the model’s predicted probability of Y=1 ), or
  • 1 - p_i if y_i = 0 (the model’s predicted probability of Y=0 ).

The likelihood of the entire dataset is the product of all individual probabilities (for each person, the probability of their actual outcome). The log-likelihood is simply the natural logarithm of this likelihood. Taking logs turns the product into a sum, which is easier to work with. The log-likelihood for a dataset of n observations can be written as​

\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \ln(p_i) + (1 - y_i) \ln(1 - p_i) \right] where p_i = P(Y_i=1 \mid X_i; \boldsymbol{\beta}) is the predicted probability for observation i with predictors X_i . This formula adds up \ln(p_i) for every case where Y=1 and \ln(1-p_i) for every case where Y=0 . Intuitively, the model gets rewarded (in log-likelihood) when it assigns a high probability to the actual outcome (since \ln(\text{higher probability}) is less negative), and it gets penalized when it assigns a low probability to the actual outcome (since \ln(\text{small probability}) is a large negative number).

  • Maximizing Log-Likelihood: Logistic regression fits the coefficients \beta by maximum likelihood estimation (MLE) – it finds the \beta values that maximize the log-likelihood. A higher log-likelihood means the model predicts the data better (it assigns higher probabilities to what actually happened). Typically, the maximized log-likelihood is a negative number (because predicting everything with 100% certainty is usually impossible, and \ln(p) is negative when p<1 ). The goal is to make this value as less negative as possible (closer to zero). A perfect model that predicts each outcome with probability 1 would have a log-likelihood of 0 (since \ln(1)=0 for each term). Real models will have some error, so the log-likelihood will be below 0. For example, a model might have \ell = -80.1 as the final log-likelihood. By itself, this number doesn’t tell you much; it’s useful in comparison to other models.
  • Using Log-Likelihood: We often use log-likelihood to compare models. For instance, the likelihood ratio chi-square test uses the difference between a full model’s log-likelihood and a reduced (null) model’s log-likelihood to assess overall model fit​. Also, Pseudo- R^2  measures (like McFadden’s R^2 ) are based on the log-likelihood of the model relative to a null model​. But remember, unlike an OLS R^2 , these are not straightforward “variance explained” percentages​.

Log-Likelihood Iteration in Stata

When you run a logistic regression in software like Stata, it uses an iterative algorithm to maximize the log-likelihood (since there’s no closed-form solution for the best \beta coefficients). Stata’s output often shows a log-likelihood iteration log like this:

Iteration 0:   log likelihood = -115.64441  
Iteration 1: log likelihood = -84.558481
Iteration 2: log likelihood = -80.491449
Iteration 3: log likelihood = -80.123052
Iteration 4: log likelihood = -80.118181
Iteration 5: log likelihood = -80.11818

Here’s how to interpret this​:

  • Iteration 0 is the log-likelihood of the “null model” (intercept only, no predictors). In this example, it’s -115.644. This serves as a baseline fit with just the overall outcome probability.
  • At Iteration 1, the predictors are introduced, and the algorithm adjusts coefficients to improve the fit. The log-likelihood jumped to -84.558, a big improvement (less negative) from -115.644. Each subsequent iteration tweaks the coefficients to increase the log-likelihood further.
  • The log-likelihood keeps increasing (from -84.56 to -80.49 to -80.12…) because the fitting algorithm (often Newton-Raphson or similar) is searching for the maximum log-likelihood​.
  • The process stops when the improvement is very small (the change in log-likelihood between iterations falls below a threshold). In the output above, by Iteration 4 and 5 the log-likelihood has essentially stabilized at -80.11818. Stata then declares the model converged and reports the final results.

The final log-likelihood = -80.11818 in this example is reported in the model output​. It corresponds to the best fit achieved. Stata also reports statistics like the likelihood-ratio chi-square (which here is 71.05, computed as -2 times difference between the null and final log-likelihood: -2[-115.644 - (-80.118)] = 71.05 )​, and a pseudo R^2 of 0.3072 which is another way to gauge model fit relative to the null model​.

▶️ Key points about iteration:

  • The log-likelihood should increase with each iteration (or at least not decrease), since the algorithm is maximizing it​. If you ever see it decrease, something is unusual (it can happen with some non-default algorithms or if constraints are applied, but generally it’s increasing).
  • Iteration 0 (Null Model): This value is often quite low (very negative) if the outcome is not extremely balanced. It reflects a model that only predicts the overall mean probability for everyone.
  • Convergence: Once the change in log-likelihood is negligible (by default criteria), the algorithm stops. If the model has trouble converging, you might see many iterations or warnings. But in a well-behaved model, it converges in a reasonable number of iterations, as in the example above (5 iterations).

Interpreting Log-Likelihood Values

A common question is how to interpret the absolute value of the log-likelihood and whether it should be “close to” some value (like 0 or 1). The log-likelihood is not bounded between 0 and 1 – it’s not a probability itself but the logarithm of a likelihood. In fact, it’s usually negative for any decent model (because no model predicts every outcome with 100% certainty). Here are some important considerations:

  • Log-likelihood vs Probability: Remember that the likelihood of the data is a product of many probabilities, often small ones. For example, if one observation has a predicted probability 0.8 of the observed outcome, and another has 0.6, their combined likelihood is 0.8 times 0.6 = 0.48. The log-likelihood for those two would be \ln(0.48) = -0.733 . With many observations, the log-likelihood is the sum of many such \ln(p) terms, so it will typically be negative (since each term is \ln(p_i) or \ln(1-p_i), which are negative when p_i < 1). A larger (less negative) log-likelihood indicates a better fit, but there is no fixed “maximum” like 1. The theoretical maximum is 0 (if the model predicted every point perfectly with probability 1, \ln(1)=0 for each observation, sum = 0). However, in practice 0 is never reached unless the data are perfectly separable (which often indicates an overfit or an infinite coefficient situation).
  • Absolute Value Meaning: The absolute value of the log-likelihood by itself isn’t directly meaningful to interpret quality. For instance, a log-likelihood of -80 might seem “higher” (better) than -200, but this also depends on sample size and complexity. Doubling the number of observations tends to roughly double the log-likelihood in magnitude (because it’s a sum over observations). What matters more is comparing log-likelihoods between models. That’s why we use measures like the likelihood ratio test or pseudo- R^2 to compare a given model’s log-likelihood to a baseline or to another model’s log-likelihood. In our Stata example, -80.118 by itself doesn’t tell much, but comparing it to the null model’s -115.644 showed a big improvement​.
  • likelihood is a probability of the observed data under the model, and a perfect model would have a likelihood of 1 (100% for all data) and log-likelihood 0. But in realistic scenarios, the likelihood (product of probabilities) becomes extremely small as data size grows, and the log-likelihood is negative. So we aim to maximize the log-likelihood, but there’s no benchmark like “should be near 0 or 1.” It should be as high as possible given the data and model. If adding predictors moves the log-likelihood from -115 to -80, that’s a significant improvement. If another model has a log-likelihood of -75, that one fits better (assuming the same data).
  • Comparisons and Scale: To interpret fit, we often compare the log-likelihood of our model to the log-likelihood of a null model or a saturated model. The null model (only intercept) gives a baseline; the saturated model (perfectly fits each observation) would have log-likelihood 0. Some pseudo- R^2 measures scale the log-likelihood this way: e.g. McFadden’s R^2 = 1 - \frac{\ell_{\text{model}}}{\ell_{\text{null}}} , which will be 0 for a model no better than null and approach 1 for a model that is nearly perfect​. But the raw log-likelihood itself is usually just used internally for calculations or comparisons, not as a standalone goodness-of-fit metric understandable in isolation​.

Summary: In logistic regression, we interpret coefficients and model fit via these linked concepts:

  • Probability is the direct chance of the outcome (easy to understand, 0–1 range).
  • Odds = p/(1-p) , comparing chance of event to no-event.
  • Log-Odds (logit) = \ln(p/(1-p)), used for the regression’s linear modeling. It’s the scale on which logistic regression coefficients add up.
  • Odds Ratio = ratio of two odds (often e^{\text{coefficient}}), which tells how a predictor multiplicatively changes the odds. It’s a common effect size in health research.
  • Log-Likelihood = sum of \ln probabilities of the observed outcomes under the model. It’s maximized to fit the model, and used to compare models. A higher log-likelihood (less negative) means a better fit. We examine its improvement during iteration and use it for statistical tests, rather than trying to interpret it as a probability itself.

Here is an example: “Having insurance is associated with higher odds of service use (OR = 3.0), meaning the odds of accessing services are three times higher for insured individuals. The model’s log-likelihood improved from -115 to -80 with the inclusion of insurance status and other variables, indicating a substantially better fit than the null model​.”

  • February 15, 2024