Presentation Slides – Overcoming Multicollinearity and Overfitting with Partial Least Squares Regression in JMP and SAS
February 20, 2013 Leave a comment
My slides on partial least squares regression at the Toronto Area SAS Society (TASS) meeting on September 14, 2012, can be found here.
My Presentation on Partial Least Squares Regression
My first presentation to Toronto Area SAS Society (TASS) was delivered on September 14, 2012. I introduced a supervised learning/predictive modelling technique called partial least squares (PLS) regression; I showed how normal linear least squares regression is often problematic when used with big data because of multicollinearity and overfitting, explained how partial least squares regression overcomes these limitations, and illustrated how to implement it in SAS and JMP. I also highlighted the variable importance for projection (VIP) score that can be used to conduct variable selection with PLS regression; in particular, I documented its effectiveness as a technique for variable selection by comparing some key journal articles on this issue in academic literature.
The green line is an overfitted classifier. Not only does it model the underlying trend, but it also models the noise (the random variation) at the boundary. It separates the blue and the red dots perfectly for this data set, but it will classify very poorly on a new data set from the same population.
Multicollinearity arises when 2 or more predictors are correlated with each other; this results in highly variable regression coefficients that strip them of their easily interpretable meaning and ruins any attempt at variable selection via stepwise selection. Overfitting occurs if too many predictors are in the model; an overfitted model detects not only the underlying trend in the data, but also the noise. Thus, it usually has very high predictive accuracy on the training set, but very low predictive accuracy on any other set (e.g. validation set or test set). The orthogonality in the extracted factors in PLS regression ensures that there is no correlation between them. Overfitting will definitely happen if there are more predictors than observations in normal linear least squares regression, but that is not a necessary condition; as long as there are more predictors than needed to model the underlying trend, overfitting will occur. Thus, I spent some time talking about how validation and cross-validation can help to overcome overfitting. Multicollinearity and overfitting are common problems in big data sets, especially those with many predictors. At Predictum, I and my fellow statistician, Diana Ballard, successfully implemented PLS regression to help a world-leading manufacturer of electronics to overcome multicollinearity and overfitting to identify the steps in its manufacturing process that caused defects.
Partial least squares regression can overcome multicollinearity and overfitting as a predictive model, and its variable importance for projection scores can be used for variable selection. These qualities make PLS regression a very attractive predictive modelling technique. TASS is kind to post my presentation slides for everyone to see, and I encourage you to consult these slides for more information.
SAS User Groups
Ever since coming to Toronto, I have found that the best way to network with statisticians, data analysts, and analytics professionals in general is attending SAS User Group meetings. SAS does a great job of holding these free meetings to allow users of its software to learn from each other through seminars, break-out discussion groups, and challenge problems. These meetings are very professionally conducted and organized by SAS and also a group of volunteers, with agendas and speakers planned months in advance. SAS Canada designates one of its employees to specifically manage its User Groups throughout Canada, and his name is Matt Malczewski. Matt is a friendly, diligent, and reliable person who brings much enthusiasm to every User Group meeting, and I highly encourage anybody who wants to get involved in the SAS community in any major city in Canada to get in touch with him. You can follow his adventures throughout his trips to these User Group meetings on his blog, Musings From An Outlier, and on his Twitter feed @Malchew.
I am very fortunate to be repeatedly invited to share my knowledge about statistics, machine learning, JMP and SAS at the Toronto Area SAS Society (TASS), the largest and most frequent of all of the SAS User Group meetings in Toronto. This is where I have met a wonderful community of analytics professionals in the Greater Toronto Area (GTA) and beyond. In fact, I met my current boss, Wayne Levin, at my first TASS meeting, where he gave a presentation that introduced me to JMP, a software that our company, Predictum, uses extensively for statistical modelling and data analysis.