[Stata] Pearson correlation analysis and plotting correlations (pwcorr and heatplot)

Correlation analysis is a statistical technique that measures the strength and direction of the relationship between two variables. In this blog post, I will show you how to conduct pairwise correlation analysis in Stata. Pairwise correlation treats each pair of variables separately and only includes observations that have valid values for each pair in the data set.

Pearson correlation coefficients

Pearson correlation coefficient is a measure of the linear relationship between two continuous variables. It indicates the strength and the direction of the association between the variables. It has a value between -1 and 1, where -1 means a perfectly negative correlation and 1 means a perfectly positive correlation.

r = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i - \bar{X})^2}\sqrt{\sum_{i=1}^{n}(Y_i - \bar{Y})^2}}
  1. r (Pearson correlation coefficient): This is the value we’re trying to calculate, and it ranges between -1 and 1.
  2. n (Number of data points): This represents how many pairs of values you have for X and Y.
  3. Xᵢ and Yᵢ: These are individual data points or observations from X and Y, respectively. The subscript “i” represents a specific data point out of the total “n” data points.
  4. X̅ (Mean of X) and \Ȳ (Mean of Y): These are the averages (means) of all the X and Y values, respectively. They represent the central tendencies of the data.

Calculation Steps

For deeper understanding on the formula, please watch this video!

Pearson correlation [Simply explained]

To interpret the Pearson correlation coefficient, you can use the following guidelines:

  • A value close to 0 means that there is no linear relationship between the variables.
  • A value close to 1 or -1 means that there is a strong linear relationship between the variables.
  • A positive value means that as one variable increases, the other variable also tends to increase.
  • A negative value means that as one variable increases, the other variable tends to decrease.

For example, if the Pearson correlation coefficient between height and weight is 0.7, there is a moderate positive linear relationship between height and weight. As height increases, weight also tends to increase. The following interactive graph by Dr. John V. Kane would also be helpful for understanding!

pwcorr command

To perform pairwise correlation, you can use the pwcorr command, followed by a list of variables that you want to examine. For example, if you want to find the pairwise correlation coefficients between var1, var2, and var3 with the significance of each pair, you can use the following command. Stata will display a matrix of correlation coefficients, along with the number of observations used for each pair.

Stata
pwcorr var1 var2 var3, sig 

With star(.05) option, Stata will display asterisks next to the coefficients that are significant at the 95% level of CI. You can change the significance level by using the star option.

Stata
pwcorr var1 var2 var3, sig star(.05)

Rule of Thumb

There is no definitive rule of thumb for interpreting the size of the correlation coefficient in social science, as different fields and contexts may have different standards and expectations. However, one possible guideline is based on the work of Cohen (1988), who suggested the following values for strong, moderate and weak relationships:

Correlation coefficientStrength of correlation
< 0.20Very weak or negligible
0.20-0.40Weak
0.40-0.60Moderate
0.60-0.80Strong
0.80-1.00Very strong

These values are not absolute, but rather relative to the research question and the domain of study. For example, in psychology, a correlation of 0.3 may be considered meaningful, while in natural science, a correlation of 0.9 may be considered trivial.

Another way to interpret the correlation coefficient is to look at the coefficient of determination (r^2), which is the square of the correlation coefficient. It represents the proportion of variance in one variable that is explained by the variance in the other variable. For example, if r = 0.7, then r^2 = 0.49, which means that 49% of the variation in one variable can be accounted for by the variation in the other variable.

Exporting output: asdoc command

The following is the asdoc command developed by a (very smart) professor. Personally, it is handier than putdocx command so I use asdoc command much more frequently. Detailed explanations can be found in the developer’s blog.

Stata
net install asdoc, from(http://fintechprofessor.com) replace // install package

With the asdoc command, you can save most tables, such as summary statistics, correlations, regressions, frequency tables, t-tests, and more, in a pretty format!

Stata
asdoc pwcorr var1 var2 ..., listwise label dec(3) sig replace star(0.05) // star(all) replace nonum options allow for automatic *** labeling on significant correlation coefficients.

The output will be saved in a Word document as follows.

Interpretation of Pearson correlation coefficient:

Plotting: heatplot command

To visualize, you can use heatplot for the list of variables. First, you need to install the commands for the first time use.

Stata
ssc install heatplot, replace
ssc install palettes, replace
ssc install colrspace, replace

Then, run the correction analysis first, and then assign the results as corrmatrix, and then you can use that matrix for heatplot. The following options are my favorite options. You can change them at your preference by referring to help heatplot.

Stata
pwcorr var1 var2 .. 
matrix corrmatrix = r(C)
heatplot corrmatrix, values(format(%4.3f)) legend(off) aspectratio(1) color(hcl diverging, intensity(.6)) lower nodiagonal

This command will return the heatplot as follows.

More information about heatplot: https://thedatahall.com/how-to-make-heatplot-in-stata-correlation-heat-plot/

Plotting: scatterfit command

The best way to plot the bivariate correlation is a scatter plot. You can draw a scatter plot with a fitted line using the following command.

Stata
twoway (scatter yvar xvar) (lfit yvar xvar)

You can also find out how to use the scatterfit command in Stata in the following post.

Reference

Correlation Coefficient | Types, Formulas & Examples (scribbr.com)

Correlation | Stata Annotated Output (ucla.edu)

  • September 12, 2023