# Sorting correlation coefficients by their magnitudes in a SAS macro

#### Theoretical Background

Many statisticians and data scientists use the correlation coefficient to study the relationship between 2 variables.  For 2 random variables, $X$ and $Y$, the correlation coefficient between them is defined as their covariance scaled by the product of their standard deviations.  Algebraically, this can be expressed as

$\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}$.

In real life, you can never know what the true correlation coefficient is, but you can estimate it from data.  The most common estimator for $\rho$ is the Pearson correlation coefficient, which is defined as the sample covariance between $X$ and $Y$ divided by the product of their sample standard deviations.  Since there is a common factor of

$\frac{1}{n - 1}$

in the numerator and the denominator, they cancel out each other, so the formula simplifies to

$r_P = \frac{\sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i = 1}^{n}(x_i - \bar{x})^2 \sum_{i = 1}^{n}(y_i - \bar{y})^2}}$.

In predictive modelling, you may want to find the covariates that are most correlated with the response variable before building a regression model.  You can do this by

1. computing the correlation coefficients
2. obtaining their absolute values
3. sorting them by their absolute values.

#### The SAS Macro

Since starting my new job at Environics Analytics last year, I have noticed that my co-workers like to perform univariate correlational analysis as the first phase in regression modelling, and that is a sensible thing to do.  I have written the following SAS macro, “corr_sort()”, to facilitate this process.

There are 4 arguments in this macro:

• The name of the data set with the variables for correlational analysis
• One target variable
• PROC CORR can handle multiple target variables, but I have to sort the absolute values of the correlation coefficients for one chosen variable, so I have written the macro to handle only one target variable
• The covariates
• Multiple covariates can be used, as demonstrated below in the example.  Just separate them by a space.
• Your chosen name for the output data set

It’s important to emphasize that there is no such distinction between “target” and “covariate” in correlational analysis – there is no directionality between the 2 sets of variables to be correlated.  However, in PROC CORR, there is a need to specify one set of variables in the VAR statement and another set of variables in the WITH statement, so I have chosen the names “target” and “covariates” in my macro.

Here is the code for the macro.  If you wish to use it in your own SAS script, you can invoke it by using the “%include” statement.

%macro corr_sort(ds, target, covariates, output_name);

* get the Pearson correlation coefficients between the target and the covariates;
* export them into a data set called "c1";
* suppress the printing of any output;
ods select none;
proc corr
data = &ds;
var &target;
with &covariates;
ods output pearsoncorr = c1;
run;
ods select all;

* give proper labels to the results;
* drop the number of cases for each target;
data c2;
set c1;

abs_corr = abs(&target);
label abs_corr = 'Absolute Value of Correlation';

label &target = 'Pearson Correlation';
rename &target = corr;

label p&target = 'P-Value';
rename p&target = PValue;

run;

* sort the results by the absolute value of the correlation coefficients;
proc sort
data = c2
out = &output_name;
by
descending
abs_corr;
run;

%mend;



#### Using the SAS Macro in an Example

Here is an example of using this macro with the CLASS data set in the SASHELP library that all SAS users can access.

%corr_sort(sashelp.class, age, height weight, corrs);

proc print
data = corrs;
title1 'Correlation with age';
run;

Here is the output as seen in the Results Viewer in SAS.

 Correlation with age

 Obs Variable corr PValue abs_corr 1 Height 0.81143 <.0001 0.81143 2 Weight 0.74089 0.0003 0.74089