March 21, 2017 Leave a comment
Many statisticians and data scientists use the correlation coefficient to study the relationship between 2 variables. For 2 random variables, and , the correlation coefficient between them is defined as their covariance scaled by the product of their standard deviations. Algebraically, this can be expressed as
In real life, you can never know what the true correlation coefficient is, but you can estimate it from data. The most common estimator for is the Pearson correlation coefficient, which is defined as the sample covariance between and divided by the product of their sample standard deviations. Since there is a common factor of
in the numerator and the denominator, they cancel out each other, so the formula simplifies to
In predictive modelling, you may want to find the covariates that are most correlated with the response variable before building a regression model. You can do this by
- computing the correlation coefficients
- obtaining their absolute values
- sorting them by their absolute values.