[Stata] Graph: Scatterfit for Scatter Plot with Fit Lines

💻 Author’s GitHub: https://github.com/leojahrens/scatterfit

📒 Reference: Ahrens, L. (2023). SCATTERFIT: Stata module to produce scatter plots with fit lines. https://EconPapers.repec.org/RePEc:boc:bocode:s459198

In this blog post, I will show you how to use the scatterfit command in Stata, which is a user-created package that produces a wide range of scatter plots with overlaid fit lines.

Basic command

The basic command is quite simple and straightforward!

ssc install scatterfit, replace // install command for first time user 
scatterfit y x [,options] // visualize the relationship between two variables y and x
slopefit y x z [,options] // visualize the relationship between y and x conditional on another continuous variable z

The package has many options to customize your graphs, such as binning the x variable, choosing different fit lines, adding confidence intervals, changing the look of the prediction line, and more. I will demonstrate some of these options with examples using the auto dataset.

Simple Example: NHANES II data

One example is better than a long explanation! Here, I am going to use the data provided by STATA, nhanes2. The NHANES II was conducted from 1976-1980 and focused on nutrition and health, but the age of participation started at 6 months. The maximum age remained 74 years. The dataset contains a number of demographic and socioeconomic variables, as well as physical and laboratory measurements. You can access the data files and documentation from the CDC website.

webuse nhanes2

Let’s start with a simple scatter plot between bmi and age. To do this, we can type scatterfit bmi age. This will produce a graph similar to the one below, which is much better than a scatter plot without a fit line (on the right side, it is from scatter bmi age, the default built-in command in Stata)!

scatterfit bmi age

For the outcome variable (y) which is the Likert scale, it is better to use the binned scatter plot. Here is an example of a binned scatter plot for the relationship between age and health status. As you can see, the graph shows a negative relationship between health status and age, as expected. The graph also shows a linear fit line and its 95% confidence interval, which are calculated by default using the regress and predict commands, automatically.

labrec hlthstat (8=.)(5=1)(4=2)(2=4)(1=5)
scatterfit hlthstat age, binned ci
scatterfit hlthstat age, binned ci by(rural)

You can also plot it by another group variable, adding by(groupvar) option! You can see living in rural is associated with a decline in health status compared to urban groups.

Finally, scatterfit allows you to plot the scatter plot with the control variables from the regressions as well. For example, you can control for bmi and race as follows. The controls option is for continuous control variable (here, bmi), while fcontrols option is for categorical control variables (here, race).

scatterfit hlthstat age, binned ci controls(bmi) fcontrols(race) by(rural)

I hope this blog post has given you a good overview of how to use scatterfit and slopefit in Stata. These commands are very flexible and powerful tools for creating scatter plots with fit lines. You can find more information and examples in the help files of scatterfit and slopefit, or on the GitHub page of the package.

help scatterfit
help slopefit

Happy plotting!

  • August 17, 2023