## Sorting correlation coefficients by their magnitudes in a SAS macro

#### Theoretical Background

Many statisticians and data scientists use the correlation coefficient to study the relationship between 2 variables.  For 2 random variables, $X$ and $Y$, the correlation coefficient between them is defined as their covariance scaled by the product of their standard deviations.  Algebraically, this can be expressed as

$\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}$.

In real life, you can never know what the true correlation coefficient is, but you can estimate it from data.  The most common estimator for $\rho$ is the Pearson correlation coefficient, which is defined as the sample covariance between $X$ and $Y$ divided by the product of their sample standard deviations.  Since there is a common factor of

$\frac{1}{n - 1}$

in the numerator and the denominator, they cancel out each other, so the formula simplifies to

$r_P = \frac{\sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i = 1}^{n}(x_i - \bar{x})^2 \sum_{i = 1}^{n}(y_i - \bar{y})^2}}$.

In predictive modelling, you may want to find the covariates that are most correlated with the response variable before building a regression model.  You can do this by

1. computing the correlation coefficients
2. obtaining their absolute values
3. sorting them by their absolute values.

## Potato Chips and ANOVA, Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry

In this second article of a 2-part series on the official JMP blog, I use analysis of variance (ANOVA) to assess a sample-preparation scheme for quantifying sodium in potato chips.  I illustrate the use of the “Fit Y by X” platform in JMP to implement ANOVA, and I propose an alternative sample-preparation scheme to obtain a sample with a smaller variance.  This article is entitled “Potato Chips and ANOVA, Part 2: Using Analysis of Variance to Improve Sample Preparation in Analytical Chemistry“.

If you haven’t read my first blog post in this series on preparing the data in JMP and using the “Stack Columns” function to transpose data from wide format to long format, check it out!  I presented this topic at the last Vancouver SAS User Group (VanSUG) meeting on Wednesday, November 4, 2015.

My thanks to Arati Mejdal, Louis Valente, and Mark Bailey at JMP for their guidance in writing this 2-part series!  It is a pleasure to be a guest blogger for JMP!

## Potato Chips and ANOVA in Analytical Chemistry – Part 1: Formatting Data in JMP

I am very excited to write again for the official JMP blog as a guest blogger!  Today, the first article of a 2-part series has been published, and it is called “Potato Chips and ANOVA in Analytical Chemistry – Part 1: Formatting Data in JMP“.  This series of blog posts will talk about analysis of variance (ANOVA), sampling, and analytical chemistry, and it uses the quantification of sodium in potato chips as an example to illustrate these concepts.

The first part of this series discusses how to import the data into the JMP and prepare them for ANOVA.  Specifically, it illustrates how the “Stack Columns” function is used to transpose the data from wide format to long format.

I will present this at the Vancouver SAS User Group (VanSUG) meeting later today.

## Odds and Probability: Commonly Misused Terms in Statistics – An Illustrative Example in Baseball

Yesterday, all 15 home teams in Major League Baseball won on the same day – the first such occurrence in history.  CTV News published an article written by Mike Fitzpatrick from The Associated Press that reported on this event.  The article states, “Viewing every game as a 50-50 proposition independent of all others, STATS figured the odds of a home sweep on a night with a full major league schedule was 1 in 32,768.”  (Emphases added)

Screenshot captured at 5:35 pm Vancouver time on Wednesday, August 12, 2015.

Out of curiosity, I wanted to reproduce this result.  This event is an intersection of 15 independent Bernoulli random variables, all with the probability of the home team winning being 0.5.

$P[(\text{Winner}_1 = \text{Home Team}_1) \cap (\text{Winner}_2 = \text{Home Team}_2) \cap \ldots \cap (\text{Winner}_{15}= \text{Home Team}_{15})]$

Since all 15 games are assumed to be mutually independent, the probability of all 15 home teams winning is just

$P(\text{All 15 Home Teams Win}) = \prod_{n = 1}^{15} P(\text{Winner}_i = \text{Home Team}_i)$

$P(\text{All 15 Home Teams Win}) = 0.5^{15} = 0.00003051757$

Now, let’s connect this probability to odds.

It is important to note that

• odds is only applicable to Bernoulli random variables (i.e. binary events)
• odds is the ratio of the probability of success to the probability of failure

For our example,

$\text{Odds}(\text{All 15 Home Teams Win}) = P(\text{All 15 Home Teams Win}) \ \div \ P(\text{At least 1 Home Team Loses})$

$\text{Odds}(\text{All 15 Home Teams Win}) = 0.00003051757 \div (1 - 0.00003051757)$

$\text{Odds}(\text{All 15 Home Teams Win}) = 0.0000305185$

The above article states that the odds is 1 in 32,768.  The fraction 1/32768 is equal to 0.00003051757, which is NOT the odds as I just calculated.  Instead, 0.00003051757 is the probability of all 15 home teams winning.  Thus, the article incorrectly states 0.00003051757 as the odds rather than the probability.

This is an example of a common confusion between probability and odds that the media and the general public often make.  Probability and odds are two different concepts and are calculated differently, and my calculations above illustrate their differences.  Thus, exercise caution when reading statements about probability and odds, and make sure that the communicator of such statements knows exactly how they are calculated and which one is more applicable.

## Mathematics and Applied Statistics Lesson of the Day – Contrasts

A contrast is a linear combination of a set of variables such that the sum of the coefficients is equal to zero.  Notationally, consider a set of variables

$\mu_1, \mu_2, ..., \mu_n$.

Then the linear combination

$c_1 \mu_1 + c_2 \mu_2 + ... + c_n \mu_n$

is a contrast if

$c_1 + c_2 + ... + c_n = 0$.

There is a reason for why I chose to use $\mu$ as the symbol for the variables in the above notation – in statistics, contrasts provide a very useful framework for comparing multiple population means in hypothesis testing.  In a later Statistics Lesson of the Day, I will illustrate some examples of contrasts, especially in the context of experimental design.

## The advantages of using count() to get N-way frequency tables as data frames in R

#### Introduction

I recently introduced how to use the count() function in the “plyr” package in R to produce 1-way frequency tables in R.  Several commenters provided alternative ways of doing so, and they are all appreciated.  Today, I want to extend that tutorial by demonstrating how count() can be used to produce N-way frequency tables in the list format – this will magnify the superiority of this function over other functions like table() and xtabs().

#### 2-Way Frequencies: The Cross-Tabulated Format vs. The List-Format

To get a 2-way frequency table (i.e. a frequency table of the counts of a data set as divided by 2 categorical variables), you can display it in a cross-tabulated format or in a list format.

In R, the xtabs() function is good for cross-tabulation.  Let’s use the “mtcars” data set again; recall that it is a built-in data set in Base R.

> y = xtabs(~ cyl + gear, mtcars)
> y
gear
cyl      3     4     5
4        1     8     2
6        2     4     1
8        12    0     2

## How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

#### Introduction

One feature that I like about R is the ability to access and manipulate the outputs of many functions.  For example, you can extract the kernel density estimates from density() and scale them to ensure that the resulting density integrates to 1 over its support set.

I recently needed to get a frequency table of a categorical variable in R, and I wanted the output as a data table that I can access and manipulate.  This is a fairly simple and common task in statistics and data analysis, so I thought that there must be a function in Base R that can easily generate this.  Sadly, I could not find such a function.  In this post, I will explain why the seemingly obvious table() function does not work, and I will demonstrate how the count() function in the ‘plyr’ package can achieve this goal.

#### The Example Data Set – mtcars

Let’s use the mtcars data set that is built into R as an example.  The categorical variable that I want to explore is “gear” – this denotes the number of forward gears in the car – so let’s view the first 6 observations of just the car model and the gear.  We can use the subset() function to restrict the data set to show just the row names and “gear”.

> head(subset(mtcars, select = 'gear'))
gear
Mazda RX4            4
Mazda RX4 Wag        4
Datsun 710           4
Hornet 4 Drive       3
Valiant              3

## Exploratory Data Analysis – All Blog Posts on The Chemical Statistician

This series of posts introduced various methods of exploratory data analysis, providing theoretical backgrounds and practical examples.  Fully commented and readily usable R scripts are available for all topics for you to copy and paste for your own analysis!  Most of these posts involve data visualization and plotting, and I include a lot of detail and comments on how to invoke specific plotting commands in R in these examples.

I will return to this blog post to add new links as I write more tutorials.

Useful R Functions for Exploring a Data Frame

The 5-Number Summary – Two Different Methods in R

Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

Quantile-Quantile Plots for New York’s Ozone Pollution Data

Kernel Density Estimation and Rug Plots in R on Ozone Data in New York and Ozonopolis

2 Ways of Plotting Empirical Cumulative Distribution Functions in R

Conceptual Foundations of Empirical Cumulative Distribution Functions

Combining Box Plots and Kernel Density Plots into Violin Plots for Ozone Pollution Data

Kernel Density Estimation – Conceptual Foundations

Variations of Box Plots in R for Ozone Concentrations in New York City and Ozonopolis

Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

The advantages of using count() to get N-way frequency tables as data frames in R

## Performing Logistic Regression in R and SAS

#### Introduction

My statistics education focused a lot on normal linear least-squares regression, and I was even told by a professor in an introductory statistics class that 95% of statistical consulting can be done with knowledge learned up to and including a course in linear regression.  Unfortunately, that advice has turned out to vastly underestimate the variety and depth of problems that I have encountered in statistical consulting, and the emphasis on linear regression has not paid dividends in my statistics career so far.  Wisdom from veteran statisticians and my own experience combine to suggest that logistic regression is actually much more commonly used in industry than linear regression.  I have already started a series of short lessons on binary classification in my Statistics Lesson of the Day and Machine Learning Lesson of the Day.    In this post, I will show how to perform logistic regression in both R and SAS.  I will discuss how to interpret the results in a later post.

#### The Data Set

The data set that I will use is slightly modified from Michael Brannick’s web page that explains logistic regression.  I copied and pasted the data from his web page into Excel, modified the data to create a new data set, then saved it as an Excel spreadsheet called heart attack.xlsx.

This data set has 3 variables (I have renamed them for convenience in my R programming).

1. ha2  – Whether or not a patient had a second heart attack.  If ha2 = 1, then the patient had a second heart attack; otherwise, if ha2 = 0, then the patient did not have a second heart attack.  This is the response variable.
2. treatment – Whether or not the patient completed an anger control treatment program.
3. anxiety – A continuous variable that scores the patient’s anxiety level.  A higher score denotes higher anxiety.

Read the rest of this post to get the full scripts and view the full outputs of this logistic regression model in both R and SAS!

## Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Chebyshev’s Inequality

In 2 recent Statistics Lessons of the Day, I

Chebyshev’s inequality is just a special version of Markov’s inequality; thus, their motivations and intuitions are similar.

$P[|X - \mu| \geq k \sigma] \leq 1 \div k^2$

Markov’s inequality roughly says that a random variable $X$ is most frequently observed near its expected value, $\mu$.  Remarkably, it quantifies just how often $X$ is far away from $\mu$.  Chebyshev’s inequality goes one step further and quantifies that distance between $X$ and $\mu$ in terms of the number of standard deviations away from $\mu$.  It roughly says that the probability of $X$ being $k$ standard deviations away from $\mu$ is at most $k^{-2}$.  Notice that this upper bound decreases as $k$ increases – confirming our intuition that it is highly improbable for $X$ to be far away from $\mu$.

As with Markov’s inequality, Chebyshev’s inequality applies to any random variable $X$, as long as $E(X)$ and $V(X)$ are finite.  (Markov’s inequality requires only $E(X)$ to be finite.)  This is quite a marvelous result!

## Mathematical and Applied Statistics Lesson of the Day – The Motivation and Intuition Behind Markov’s Inequality

Markov’s inequality may seem like a rather arbitrary pair of mathematical expressions that are coincidentally related to each other by an inequality sign:

$P(X \geq c) \leq E(X) \div c,$ where $c > 0$.

However, there is a practical motivation behind Markov’s inequality, and it can be posed in the form of a simple question: How often is the random variable $X$ “far” away from its “centre” or “central value”?

Intuitively, the “central value” of $X$ is the value that of $X$ that is most commonly (or most frequently) observed.  Thus, as $X$ deviates further and further from its “central value”, we would expect those distant-from-the-centre vales to be less frequently observed.

Recall that the expected value, $E(X)$, is a measure of the “centre” of $X$.  Thus, we would expect that the probability of $X$ being very far away from $E(X)$ is very low.  Indeed, Markov’s inequality rigorously confirms this intuition; here is its rough translation:

As $c$ becomes really far away from $E(X)$, the event $X \geq c$ becomes less probable.

You can confirm this by substituting several key values of $c$.

• If $c = E(X)$, then $P[X \geq E(X)] \leq 1$; this is the highest upper bound that $P(X \geq c)$ can get.  This makes intuitive sense; $X$ is going to be frequently observed near its own expected value.

• If $c \rightarrow \infty$, then $P(X \geq \infty) \leq 0$.  By Kolmogorov’s axioms of probability, any probability must be inclusively between $0$ and $1$, so $P(X \geq \infty) = 0$.  This makes intuitive sense; there is no possible way that $X$ can be bigger than positive infinity.

## The Chi-Squared Test of Independence – An Example in Both R and SAS

#### Introduction

The chi-squared test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data.  Given 2 categorical random variables, $X$ and $Y$, the chi-squared test of independence determines whether or not there exists a statistical dependence between them.  Formally, it is a hypothesis test with the following null and alternative hypotheses:

$H_0: X \perp Y \ \ \ \ \ \text{vs.} \ \ \ \ \ H_a: X \not \perp Y$

If you’re not familiar with probabilistic independence and how it manifests in categorical random variables, watch my video on calculating expected counts in contingency tables using joint and marginal probabilities.  For your convenience, here is another video that gives a gentler and more practical understanding of calculating expected counts using marginal proportions and marginal totals.

Today, I will continue from those 2 videos and illustrate how the chi-squared test of independence can be implemented in both R and SAS with the same example.

## Applied Statistics Lesson of the Day – The Coefficient of Variation

In my statistics classes, I learned to use the variance or the standard deviation to measure the variability or dispersion of a data set.  However, consider the following 2 hypothetical cases:

1. the standard deviation for the incomes of households in Canada is $2,000 2. the standard deviation for the incomes of the 5 major banks in Canada is$2,000

Even though this measure of dispersion has the same value for both sets of income data, $2,000 is a significant amount for a household, whereas$2,000 is not a lot of money for one of the “Big Five” banks.  Thus, the standard deviation alone does not give a fully accurate sense of the relative variability between the 2 data sets.  One way to overcome this limitation is to take the mean of the data sets into account.

A useful statistic for measuring the variability of a data set while scaling by the mean is the sample coefficient of variation:

$\text{Sample Coefficient of Variation (} \bar{c_v} \text{)} \ = \ s \ \div \ \bar{x},$

where $s$ is the sample standard deviation and $\bar{x}$ is the sample mean.

Analogously, the coefficient of variation for a random variable is

$\text{Coefficient of Variation} \ (c_v) \ = \ \sigma \div \ \mu,$

where $\sigma$ is the random variable’s standard deviation and $\mu$ is the random variable’s expected value.

The coefficient of variation is a very useful statistic that I, unfortunately, never learned in my introductory statistics classes.  I hope that all new statistics students get to learn this alternative measure of dispersion.

## Machine Learning and Applied Statistics Lesson of the Day – Positive Predictive Value and Negative Predictive Value

For a binary classifier,

• its positive predictive value (PPV) is the proportion of positively classified cases that were truly positive.

$\text{PPV} = \text{(Number of True Positives)} \ \div \ \text{(Number of True Positives} \ + \ \text{Number of False Positives)}$

• its negative predictive value (NPV) is the proportion of negatively classified cases that were truly negative.

$\text{NPV} = \text{(Number of True Negatives)} \ \div \ \text{(Number of True Negatives} \ + \ \text{Number of False Negatives)}$

In a later Statistics and Machine Learning Lesson of the Day, I will discuss the differences between PPV/NPV and sensitivity/specificity in assessing the predictive accuracy of a binary classifier.

(Recall that sensitivity and specificity can also be used to evaluate the performance of a binary classifier.  Based on those 2 statistics, we can construct receiver operating characteristic (ROC) curves to assess the predictive accuracy of the classifier, and a minimum standard for a good ROC curve is being better than the line of no discrimination.)

## Video Tutorial – Calculating Expected Counts in a Contingency Table Using Joint Probabilities

In an earlier video, I showed how to calculate expected counts in a contingency table using marginal proportions and totals.  (Recall that expected counts are needed to conduct hypothesis tests of independence between categorical random variables.)  Today, I want to share a second video of calculating expected counts – this time, using joint probabilities.  This method uses the definition of independence between 2 random variables to form estimators of the joint probabilities for each cell in the contingency table.  Once the joint probabilities are estimated, the expected counts are simply the joint probabilities multipled by the grand total of the entire sample.  This method gives a more direct and deeper connection between the null hypothesis of a test of independence and the calculation of expected counts.

I encourage you to watch both of my videos on expected counts in my YouTube channel to gain a deeper understanding of how and why they can be calculated.  Please note that the expected counts are slightly different in the 2 videos due to round-off error; if you want to be convinced about this, I encourage you to do the calculations in the 2 different orders as I presented in the 2 videos – you will eventually see where the differences arise.

You can also watch the video below the fold!

## Video Tutorial – Calculating Expected Counts in Contingency Tables Using Marginal Proportions and Marginal Totals

A common task in statistics and biostatistics is performing hypothesis tests of independence between 2 categorical random variables.  The data for such tests are best organized in contingency tables, which allow expected counts to be calculated easily.  In this video tutorial in my Youtube channel, I demonstrate how to calculate expected counts using marginal proportions and marginal totals.  In a later video, I will introduce a second method for calculating expected counts using joint probabilities and marginal probabilities.

In a later tutorial, I will illustrate how to implement the chi-squared test of independence on the same data set in R and SAS – stay tuned!

You can also watch the video below the fold!

## Mathematics and Applied Statistics Lesson of the Day – The Geometric Mean

Suppose that you invested in a stock 3 years ago, and the annual rates of return for each of the 3 years were

• 5% in the 1st year
• 10% in the 2nd year
• 15% in the 3rd year

What is the average rate of return in those 3 years?

It’s tempting to use the arithmetic mean, since we are so used to using it when trying to estimate the “centre” of our data.  However, the arithmetic mean is not appropriate in this case, because the annual rate of return implies a multiplicative growth of your investment by a factor of $1 + r$, where $r$ is the rate of return in each year.  In contrast, the arithmetic mean is appropriate for quantities that are additive in nature; for example, your average annual salary from the past 3 years is the sum of last 3 annual salaries divided by 3.

If the arithmetic mean is not appropriate, then what can we use instead?  Our saviour is the geometric mean, $G$.  The average factor of growth from the 3 years is

$G = [(1 + r_1)(1 + r_2) ... (1 + r_n)]^{1/n}$,

where $r_i$ is the rate of return in year $i$, $i = 1, 2, 3, ..., n$.  The average annual rate of return is $G - 1$.  Note that the geometric mean is NOT applied to the annual rates of return, but the annual factors of growth.

Returning to our example, our average factor of growth is

$G = [(1 + 0.05) \times (1 + 0.10) \times (1 + 0.15)]^{1/3} = 1.099242$.

Thus, our annual rate of return is $G - 1 = 1.099242 - 1 = 0.099242 = 9.9242\%$.

Here is a good way to think about the difference between the arithmetic mean and the geometric mean.  Suppose that there are 2 sets of numbers.

1. The first set, $S_1$, consists of your data $x_1, x_2, ..., x_n$, and this set has a sample size of $n$.
2. The second, $S_2$,  set also has a sample size of $n$, but all $n$ values are the same – let’s call this common value $y$.
• What number must $y$ be such that the sums in $S_1$ and $S_2$ are equal?  This value of $y$ is the arithmetic mean of the first set.
• What number must $y$ be such that the products in $S_1$ and $S_2$ are equal?  This value of $y$ is the geometric mean of the first set.

Note that the geometric means is only applicable to positive numbers.

## Mathematics and Applied Statistics Lesson of the Day – The Weighted Harmonic Mean

In a previous Statistics Lesson of the Day on the harmonic mean, I used an example of a car travelling at 2 different speeds – 60 km/hr and 40 km/hr.  In that example, the car travelled 120 km at both speeds, so the 2 speeds had equal weight in calculating the harmonic mean of the speeds.

What if the cars travelled different distances at those speeds?  In that case, we can modify the calculation to allow the weight of each datum to be different.  This results in the weighted harmonic mean, which has the formula

$H = \sum_{i = 1}^{n} w_i \ \ \div \ \ \sum_{i = 1}^{n}(w_i \ \div \ x_i)$.

For example, consider a car travelling for 240 kilometres at 2 different speeds and for 2 different distances:

1. 60 km/hr for 100 km
2. 40 km/hr for another 140 km

Then the weighted harmonic mean of the speeds (i.e. the average speed of the whole trip) is

$(100 \text{ km} \ + \ 140 \text{ km}) \ \div \ [(100 \text{ km} \ \div \ 60 \text{ km/hr}) \ + \ (140 \text{ km} \ \div \ 40 \text{ km/hr})]$

$= 46.45 \text{ km/hr}$

Notice that this is exactly the same calculation that we would use if we wanted to calculate the average speed of the whole trip by the formula from kinematics:

$\text{Average Speed} = \Delta \text{Distance} \div \Delta \text{Time}$

## Machine Learning and Applied Statistics Lesson of the Day – The Line of No Discrimination in ROC Curves

After training a binary classifier, calculating its various values of sensitivity and specificity, and constructing its receiver operating characteristic (ROC) curve, we can use the ROC curve to assess the predictive accuracy of the classifier.

A minimum standard for a good ROC curve is being better than the line of no discrimination.  On a plot of

$\text{Sensitivity}$

on the vertical axis and

$1 - \text{Specificity}$

on the horizontal axis, the line of no discrimination is the line that passes through the points

$(\text{Sensitivity} = 0, 1 - \text{Specificity} = 0)$

and

$(\text{Sensitivity} = 1, 1 - \text{Specificity} = 1)$.

In other words, the line of discrimination is the diagonal line that runs from the bottom left to the top right.  This line shows the performance of a binary classifier that predicts the class of the target variable purely by the outcome of a Bernoulli random variable with 0.5 as its probability of attaining the “Success” category.  Such a classifier does not use any of the predictors to make the prediction; instead, its predictions are based entirely on random guessing, with the probabilities of predicting the “Success” class and the “Failure” class being equal.

If we did not have any predictors, then we can rely on only random guessing, and a random variable with the distribution $\text{Bernoulli}(0.5)$ is the best that we can use for such guessing.  If we do have predictors, then we aim to develop a model (i.e. the binary classifier) that uses the information from the predictors to make predictions that are better than random guessing.  Thus, a minimum standard of a binary classifier is having an ROC curve that is higher than the line of no discrimination.  (By “higher“, I mean that, for a given value of $1 - \text{Specificity}$, the $\text{Sensitivity}$ of the binary classifier is higher than the $\text{Sensitivity}$ of the line of no discrimination.)

## Mathematical and Applied Statistics Lesson of the Day – Don’t Use the Terms “Independent Variable” and “Dependent Variable” in Regression

In math and science, we learn the equation of a line as

$y = mx + b$,

with $y$ being called the dependent variable and $x$ being called the independent variable.  This terminology holds true for more complicated functions with multiple variables, such as in polynomial regression.

I highly discourage the use of “independent” and “dependent” in the context of statistics and regression, because these terms have other meanings in statistics.  In probability, 2 random variables $X_1$ and $X_2$ are independent if their joint distribution is simply a product of their marginal distributions, and they are dependent if otherwise.  Thus, the usage of “independent variable” for a regression model with 2 predictors becomes problematic if the model assumes that the predictors are random variables; a random effects model is an example with such an assumption.  An obvious question for such models is whether or not the independent variables are independent, which is a rather confusing question with 2 uses of the word “independent”.  A better way to phrase that question is whether or not the predictors are independent.

Thus, in a statistical regression model, I strongly encourage the use of the terms “response variable” or “target variable” (or just “response” and “target”) for $Y$ and the terms “explanatory variables”, “predictor variables”, “predictors”, “covariates”, or “factors” for $x_1, x_2, .., x_p$.

(I have encountered some statisticians who prefer to reserve “covariate” for continuous predictors and “factor” for categorical predictors.)