# [Stata] Poisson Regression: poisson, nbreg, zip, countfit

In this blog post, we’ll explore Poisson regression models using the HINTS 6 dataset. Our dependent variable (DV) is `drinkdaysperweek`

, which represents the number of days per week that the participant had at least one drink during the past 30 days (i.e., average drink days).

We’ll explore the suitability of various models for count data, specifically looking at Poisson regression, negative binomial regression (NBR), zero-inflated Poisson (ZIP), and zero-inflated negative binomial (ZINB) models, alongside a comprehensive approach to model comparison using the `countfit`

command.

## Concepts

**Overdispersion **

- Overdispersion means that the
**variance of the response is greater than what’s assumed by the model**(ref). - For your understanding, you can find the following figure from this website.

**Poisson Regression**

- Poisson regression is used to model count data. It predicts the count of events happening within a fixed interval of time or space, assuming the events occur with a known constant mean rate and independently of the time since the last event.
**Use Case**: Suitable for counts where the mean and variance of the distribution are approximately equal.**Limitation**: It may not perform well if the data exhibits**overdispersion**, where**the variance is much larger than the mean**.

**Negative Binomial Regression (NBR)**

- An extension of Poisson regression that can handle overdispersed count data. It introduces an extra parameter to model the variance separately from the mean, allowing for greater flexibility.
**Use Case**: Ideal for count data that is overdispersed. This is often evident when the variance significantly exceeds the mean.**Difference from Poisson**: It can model data with a greater variance, making it more flexible for real-world count data that doesn’t fit the strict assumptions of Poisson regression.

**Zero-Inflated Poisson (ZIP)**

- ZIP models are designed to handle count data with an excess number of zeros, more than what the Poisson distribution would predict. It combines a Poisson count model with a logistic regression model to predict the presence of excess zeros.
**Use Case**: Useful in scenarios where the data includes both genuine zero counts and excess zeros due to some subjects not being at risk of the event.**Difference from Poisson/NBR**: Specifically addresses the issue of excess zeros in the data, which neither the standard Poisson nor the negative binomial regression models directly account for.

**Zero-Inflated Negative Binomial (ZINB)**

- This model is a combination of the negative binomial model and a process for accounting for excess zeros. Like the ZIP model, it uses logistic regression to model the excess zeros but applies the negative binomial distribution to the count data.
**Use Case**: Ideal for overdispersed count data with an excess number of zeros.**Difference from ZIP**: It is more flexible than ZIP because it can handle both overdispersion and excess zeros.

### Zero-Inflated Models

⭐**“True zeros” and “excess zeros” **

** Please see these slides for more information.

For your understanding, let’s think about the context of modeling investor trading activity. Suppose that there are two distinct groups:

- Group 1 (“excess” zeros): Investors who will never trade, guaranteeing zero trade counts always.
- Group 2 (“true” zeros): Investors who do trade, with their trade counts following a Poisson or negative binomial distribution that can produce zeros or positive counts.

Zero-inflated models are designed to handle this **excess zero** problem. They assume the data comes from two separate parts:

- A process generating “true” or “always” zeros. Using the investor example, these would represent investors who never trade at all, so their trade counts will always be zero.
- A process generating counts from a traditional count distribution like Poisson or negative binomial. These would represent active investors for whom trade counts of zero or more are possible outcomes.

The two main zero-inflated models are:

Zero-Inflated Poisson (ZIP): This has a binary component modeling the excess or “true” zeros, and a Poisson component modeling the counts, which can also include zeros from the counting process.

Zero-Inflated Negative Binomial (ZINB): This extends ZIP by using a negative binomial instead of a Poisson distribution for the counting process to account for over-dispersion.

The key advantage of these two-part zero-inflated models is explicitly modeling the excess zeros separately from the counts, which traditional count models cannot do appropriately.

**Preparing the Data**

We will first assign the name for the list of predictors in our model with global command. By using this command, you don’t have to repeat the list of variables in your model. You can just put `$predictors`

after assigning this list of variables. Please find this post for more about this command. Remember to replace `depvar`

and `$predictors`

with the actual variable names and predictors you’re using in your analysis.

`global predictors age i.birthgender i.raceethn5 i.educa i.workfulltime`

## Step 1: Histogram and Frequency Distribution

The initial analysis involves visually inspecting the distribution of `drinkdaysperweek`

and summarizing its statistics to check for overdispersion—a situation where the variance significantly exceeds the mean. Overdispersion indicates that a basic Poisson regression may not adequately model the data.

We start by visualizing the frequency distribution of our count variable:

```
hist drinkdaysperweek, discrete freq
sum drinkdaysperweek, detail
```

This histogram reveals the frequency distribution of average drink days after excluding the participant. The distribution may show a **skew**, with **a majority of smaller average drink days and a few larger ones**, indicating potential **over-dispersion** or **zero-inflation**, common traits in count data.

## Step 2: Poisson Regression

To explore how well the Poisson distribution predicts our outcome, we graphically compare observed counts to those predicted by a basic Poisson regression:

```
poisson drinkdaysperweek
mgen, pr(0/7) meanpred stub(psn)
```

`mgen`

is a post-estimation command that uses `margins`

to create six new variables that have the prefix `psn`

(whatever you put in the “stub(xxxx)” option. These values will fall in the range in `pr(xx/xx)`

.

We label our variables for clarity and generate a graph to compare observed counts with Poisson predictions:

```
label var psnobeq "observed"
label var psnpreq "poisson prediction"
label var psnval "# of average drink days"
graph twoway connected psnobeq psnpreq psnval
```

It appears that this Poisson distribution tends to underestimate **occurrences of 0s, while overestimating occurrences of 1s, 2s, and 3s, **and then underestimating occurrences of 5s, 6s, and 7s (although to a lesser extent). This discrepancy arises because our current model fails to consider variations in mean values, assuming uniform drinking rates across all individuals. To address this, we should explore a model incorporating other covariates to enhance its accuracy.

```
poisson drinkdaysperweek $predictors
mgen, pr(0/7) meanpred stub(psn) replace
// replace option will replace saved values
```

Although there has been a slight enhancement, **the variability observed in values 0 and 1** indicates that a **Poisson model might not be the most suitable choice for this outcome**. Therefore, let’s explore a negative binomial regression (NBR) instead in the next step. NBR aims to address this overdispersion issue by incorporating an additional parameter (alpha) into the model.

```
nbreg drinkdaysperweek $predictors
mgen, pr(0/7) meanpred stub(psn) replace
```

Great improvement! Since it has been overdispersed, nbreg command returns much better output in terms of model fit. Now we see that from 0 to 3, it has no overfitting problem anymore.

Towards the end of our NBR table, there’s a result from the LR test. This LR test is aimed at **determining whether overdispersion exists**.

- If the test isn’t statistically significant (p > .05): we can accept the null hypothesis, implying that our dispersion parameter (alpha) equals 0, indicating
**no overdispersion**. It indicates that using the regular Poisson model might be better. - If the test is statistically significant (p < .05): we reject the null hypothesis of alpha=0, indicating
**overdispersion**.**This reinforces our decision to use an NBR model over a regular Poisson model.**

## Step 3: Model Comparison

To compare alternative models, we use the `countfit`

command or a manual approach. We consider the Poisson Regression Model (PRM), the Negative Binomial Regression Model (NBRM), the Zero-Inflated Poisson (ZIP), and the Zero-Inflated Negative Binomial (ZINB).

#### Approach 1: `countfit`

command

Now, let’s question whether NBR is indeed the optimal model for our dataset. Contrary to the **poisson**, **NBR** (and hurdle) models, **zero-inflated models** (zip and zinb) allocate zero probability to certain observations instead of a positive probability across the board. For instance, some participants may adhere to religious doctrines that forbid the consumption of alcohol. This suggests they inherently have zero probability of alcohol consumption as they will categorically abstain from drinking. **Zero-inflated poisson** and **zero-inflated negative binomial models** accommodate this scenario by segregating individuals into two hidden categories: one group is characterized by a perpetual zero probability of alcohol use, while the other group does not exhibit this constant trait.

To determine the most suitable model, we utilize the `countfit`

command, comparing Poisson, NBR, ZIP, and ZINB models.

`countfit drinkdaysperweek $predictors, inflate($predictors) forcevuong`

When using graphical representations for comparisons, our goal is **to have our predicted values as close to 0 as possible, indicating minimal deviation from our observed data**. In this instance, it appears that NBR, ZIP, and ZINB all perform well, with ZINB emerging as the most suitable fit.

Several familiar fit metrics are generated here, including AIC, BIC, and LRx2. **Lower AIC & BIC values and higher LRx2 scores might indicate a better fit.** This comparison, through metrics such as AIC, BIC, and likelihood ratio tests, reveals that while NBR, ZIP, and ZINB models all improve upon the basic Poisson model, ZINB provides the best overall fit, effectively handling both overdispersion and excess zeros.

#### Approach 2: manual approach

If you would love to run each regression separately, you can use the following commands.

```
poisson drinkdaysperweek $predictors
estimates store poisson
nbreg drinkdaysperweek $predictors
estimates store nbreg
zip drinkdaysperweek $predictors, inflate($predictors) forcevuong
estimates store zip
zinb drinkdaysperweek $predictors, inflate($predictors) forcevuong
estimates store zinb
```

In the `zip`

and `zinb`

commands, the `inflate`

option informs Stata about the variables we believe may influence whether someone consistently has zero drink days or occasionally/never.

## Question 4: Model Estimation and Interpretation

Based on our findings, we choose the best-fitting model (zero-inflated negative binomial model, in our example) and estimate it:

`zinb drinkdaysperweek $predictors, inflate($predictors) forcevuong irr`

We can use `listcoef`

to obtain interpretable coefficients with odds ratio.

```
listcoef, help // getting odds ratio
listcoef, percent help // getting odds ratio (percentage change)
```

**The output shows two tables**: 1) **the percentage change in the expected count for those not always zero** (referred to as group ~A) and 2) **the factor change in odds of always being in the zero-count group (group A) compared to those not consistently in group A** (~A).

In the zero-inflated Poisson (ZIP) or zero-inflated negative binomial (ZINB) model, the sample size remains the same for both parts of the model: the Poisson/NB count model (Part 1) and the logit model (Part 2). **The ZIP/ZINB model does not actually split the sample into two separate parts; instead, it considers the probability of excess zeros in the estimation process**. Please find this post for more information on the graphic and formula.

Part 1: The Poisson count model

In the Poisson count model, the ZIP model takes into account the possibility of both excess and true zeros by assigning probabilities to each observation. This process involves creating a matrix that allocates the likelihood of an observation being either an excess or true zero. The coefficients in the Poisson model are then calculated considering these assigned probabilities.

Part 2: The logit model

The logit model in the ZIP framework calculates the probability of an observation always being zero (0 or 1). This probability is determined through a logit process, which is used to refine the Poisson model by incorporating the excess zero vector values obtained from the logit model. The term “zero inflation” refers to this process of considering the excess zeros through the logit model, which enhances the accuracy of the Poisson model.

Across these two parts, the sample size itself remains unchanged, as the ZIP model does not physically separate the data into two distinct samples. Rather, the logit model is used to improve the Poisson model’s estimates. The hurdle model actually splits the sample itself into two parts, so it does not include “0” in the count model.

**For group ~A**, the data shows how different variables explain **the expected count of days with alcohol consumption**. For example, each additional year of age increases the expected count of alcohol consumption days by 0.9%, with a 16.7% increase for one standard deviation (SD) increase in age. Being female decreases the expected count by 18.3%, with a 9.4% decrease for a one SD increase in being female. compared to male counterparts. Similarly, belonging to different racial or ethnic groups affects the expected count: Non-Hispanic Blacks or African Americans see a 14.7% decrease, Hispanics a 25.7% decrease, Non-Hispanic Asians a 41.9% decrease, and Non-Hispanic Others a 9.9% decrease in expected alcohol consumption days, compared to Non-Hispanic White counterparts.

**For the binary equation** focusing on **the odds of being in the always zero-drinking group** (group A), being female increases the odds by 49.9%, with a 21.9% increase for a one SD increase in being female, compared to male counterparts. Age increases the odds of never drinking by 3.2%, with a 71.5% increase for a one SD increase in age. Ethnicity also plays a significant role, with Non-Hispanic Blacks or African Americans having a 48.8% increase in the odds, Non-Hispanic Asians a 142.4% increase, and Non-Hispanic Others a 91.6% increase in the odds of never drinking compared to Non-Hispanic White counterparts. Educational attainment significantly influences the odds of never drinking, with High School Graduates seeing a 43.8% decrease, Some College education a 65.3% decrease, and College Graduates or more a 78.0% decrease in the odds of never drinking, compared to participants with less than high school education.

Alternatively, we can use `mchange`

or `margins`

commands to interpret the marginal effects. For more information, you can find the post on marginal effects here.

### Reference

Poisson Regression | Stata Data Analysis Examples (ucla.edu)

How can I use countfit in choosing a count model? | Stata FAQ (ucla.edu)

Models for Count Outcomes (nd.edu)

Regression with Count Variable | DATA with STATA (ubc.ca)

Count outcomes – Poisson regression (umn.edu)

Zero-inflated models (otago.ac.nz)

Chapter 18 Count Models | Econometrics for Business Analytics (bookdown.org)