Sorting correlation coefficients by their magnitudes in a SAS macro

Theoretical Background

Many statisticians and data scientists use the correlation coefficient to study the relationship between 2 variables.  For 2 random variables, X and Y, the correlation coefficient between them is defined as their covariance scaled by the product of their standard deviations.  Algebraically, this can be expressed as

\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}.

In real life, you can never know what the true correlation coefficient is, but you can estimate it from data.  The most common estimator for \rho is the Pearson correlation coefficient, which is defined as the sample covariance between X and Y divided by the product of their sample standard deviations.  Since there is a common factor of

\frac{1}{n - 1}

in the numerator and the denominator, they cancel out each other, so the formula simplifies to

r_P = \frac{\sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i = 1}^{n}(x_i - \bar{x})^2 \sum_{i = 1}^{n}(y_i - \bar{y})^2}} .


In predictive modelling, you may want to find the covariates that are most correlated with the response variable before building a regression model.  You can do this by

  1. computing the correlation coefficients
  2. obtaining their absolute values
  3. sorting them by their absolute values.

Read more of this post