[Stata] Graph: Scatterfit for Scatter Plot with Fit Lines
💻 Author’s GitHub: https://github.com/leojahrens/scatterfit
📒 Reference: Ahrens, L. (2023). SCATTERFIT: Stata module to produce scatter plots with fit lines. https://EconPapers.repec.org/RePEc:boc:bocode:s459198
In this blog post, I will show you how to use the scatterfit command in Stata, which is a user-created package that produces a wide range of scatter plots with overlaid fit lines.
Basic command
The basic command is quite simple and straightforward!
ssc install scatterfit, replace // install command for first time user
scatterfit y x [,options] // visualize the relationship between two variables y and x
slopefit y x z [,options] // visualize the relationship between y and x conditional on another continuous variable z
The package has many options to customize your graphs, such as binning the x variable, choosing different fit lines, adding confidence intervals, changing the look of the prediction line, and more. I will demonstrate some of these options with examples using the auto dataset.
Simple Example: NHANES II data
One example is better than a long explanation! Here, I am going to use the data provided by STATA, nhanes2
. The NHANES II was conducted from 1976-1980 and focused on nutrition and health, but the age of participation started at 6 months. The maximum age remained 74 years. The dataset contains a number of demographic and socioeconomic variables, as well as physical and laboratory measurements. You can access the data files and documentation from the CDC website.
webuse nhanes2
Let’s start with a simple scatter plot between bmi and age. To do this, we can type scatterfit bmi age
. This will produce a graph similar to the one below, which is much better than a scatter plot without a fit line (on the right side, it is from scatter bmi age, the default built-in command in Stata)!
scatterfit bmi age
For the outcome variable (y) which is the Likert scale, it is better to use the binned scatter plot. Here is an example of a binned scatter plot for the relationship between age and health status. As you can see, the graph shows a negative relationship between health status and age, as expected. The graph also shows a linear fit line and its 95% confidence interval, which are calculated by default using the regress
and predict
commands, automatically.
labrec hlthstat (8=.)(5=1)(4=2)(2=4)(1=5)
scatterfit hlthstat age, binned ci
scatterfit hlthstat age, binned ci by(rural)
You can also plot it by another group variable, adding by(groupvar)
option! You can see living in rural is associated with a decline in health status compared to urban groups.
Finally, scatterfit allows you to plot the scatter plot with the control variables from the regressions as well. For example, you can control for bmi
and race
as follows. The controls
option is for continuous control variable (here, bmi
), while fcontrols
option is for categorical control variables (here, race
).
scatterfit hlthstat age, binned ci controls(bmi) fcontrols(race) by(rural)
I hope this blog post has given you a good overview of how to use scatterfit and slopefit in Stata. These commands are very flexible and powerful tools for creating scatter plots with fit lines. You can find more information and examples in the help files of scatterfit
and slopefit
, or on the GitHub page of the package.
help scatterfit
help slopefit
Happy plotting!