# [Stata] Pearson correlation analysis and plotting correlations (pwcorr and heatplot)

Correlation analysis is a statistical technique that measures the strength and direction of the relationship between two variables. In this blog post, I will show you how to conduct pairwise correlation analysis in Stata. **Pairwise correlation** treats each pair of variables separately and only includes observations that have valid values for each pair in the data set.

### Pearson correlation coefficients

Pearson correlation coefficient is a measure of the linear relationship between two continuous variables. It indicates the strength and the direction of the association between the variables. **It has a value between -1 and 1, where -1 means a perfectly negative correlation and 1 means a perfectly positive correlation**.

r = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i - \bar{X})^2}\sqrt{\sum_{i=1}^{n}(Y_i - \bar{Y})^2}}

**r (Pearson correlation coefficient):**This is the value we’re trying to calculate, and it ranges between -1 and 1.**n (Number of data points):**This represents how many pairs of values you have for X and Y.**Xᵢ and Yᵢ:**These are individual data points or observations from X and Y, respectively. The subscript “i” represents a specific data point out of the total “n” data points.**X̅ (Mean of X) and \Ȳ (Mean of Y)**: These are the averages (means) of all the X and Y values, respectively. They represent the central tendencies of the data.

**Calculation Steps **

For deeper understanding on the formula, please watch this video!

To interpret the Pearson correlation coefficient, you can use the following guidelines:

- A value close to 0 means that there is no linear relationship between the variables.
- A value close to 1 or -1 means that there is a strong linear relationship between the variables.
- A positive value means that as one variable increases, the other variable also tends to increase.
- A negative value means that as one variable increases, the other variable tends to decrease.

For example, if the Pearson correlation coefficient between height and weight is 0.7, there is a moderate positive linear relationship between height and weight. As height increases, weight also tends to increase. The following interactive graph by Dr. John V. Kane would also be helpful for understanding!

`pwcorr`

command

To perform pairwise correlation, you can use the `pwcorr`

command, followed by a list of variables that you want to examine. For example, if you want to find the pairwise correlation coefficients between var1, var2, and var3 with the significance of each pair, you can use the following command. Stata will display a matrix of correlation coefficients, along with the number of observations used for each pair.

`pwcorr var1 var2 var3, sig `

With `star(.05)`

option, Stata will display asterisks next to the coefficients that are significant at the 95% level of CI. You can change the significance level by using the star option.

`pwcorr var1 var2 var3, sig star(.05)`

**Rule of Thumb**

There is no definitive rule of thumb for interpreting the size of the correlation coefficient in social science, as different fields and contexts may have different standards and expectations. However, one possible guideline is based on the work of Cohen (1988), who suggested the following values for strong, moderate and weak relationships:

Correlation coefficient | Strength of correlation |
---|---|

< 0.20 | Very weak or negligible |

0.20-0.40 | Weak |

0.40-0.60 | Moderate |

0.60-0.80 | Strong |

0.80-1.00 | Very strong |

These values are not absolute, but rather relative to the research question and the domain of study. For example, in psychology, a correlation of 0.3 may be considered meaningful, while in natural science, a correlation of 0.9 may be considered trivial.

Another way to interpret the correlation coefficient is to look at the coefficient of determination (r^2), which is the square of the correlation coefficient. It represents the proportion of variance in one variable that is explained by the variance in the other variable. For example, if r = 0.7, then r^2 = 0.49, which means that 49% of the variation in one variable can be accounted for by the variation in the other variable.

### Exporting output: `asdoc`

command

The following is the `asdoc`

command developed by a (very smart) professor. Personally, it is handier than `putdocx`

command so I use `asdoc`

command much more frequently. Detailed explanations can be found in the developer’s blog.

`net install asdoc, from(http://fintechprofessor.com) replace // install package`

With the `asdoc`

command, you can save most tables, such as summary statistics, correlations, regressions, frequency tables, t-tests, and more, in a pretty format!

`asdoc pwcorr var1 var2 ..., listwise label dec(3) sig replace star(0.05) // star(all) replace nonum options allow for automatic *** labeling on significant correlation coefficients.`

The output will be saved in a Word document as follows.

Interpretation of Pearson correlation coefficient:

### Plotting: `heatplot`

command

To visualize, you can use `heatplot`

for the list of variables. First, you need to install the commands for the first time use.

```
ssc install heatplot, replace
ssc install palettes, replace
ssc install colrspace, replace
```

Then, run the correction analysis first, and then assign the results as `corrmatrix`

, and then you can use that `matrix`

for `heatplot`

. The following options are my favorite options. You can change them at your preference by referring to help `heatplot`

.

```
pwcorr var1 var2 ..
matrix corrmatrix = r(C)
heatplot corrmatrix, values(format(%4.3f)) legend(off) aspectratio(1) color(hcl diverging, intensity(.6)) lower nodiagonal
```

This command will return the heatplot as follows.

More information about `heatplot`

: https://thedatahall.com/how-to-make-heatplot-in-stata-correlation-heat-plot/

### Plotting: `scatterfit`

command

The best way to plot the bivariate correlation is a scatter plot. You can draw a scatter plot with a fitted line using the following command.

`twoway (scatter yvar xvar) (lfit yvar xvar)`

You can also find out how to use the `scatterfit`

command in Stata in the following post.

### Reference

Correlation Coefficient | Types, Formulas & Examples (scribbr.com)