Presentation Slides – Finding Patterns in Data with K-Means Clustering in JMP and SAS

My slides on K-means clustering at the Toronto Area SAS Society (TASS) meeting on December 14, 2012, can be found here.

Screen Shot 2014-01-04 at 8.15.18 PM

This image is slightly enhanced from an image created by Weston.pace from Wikimedia Commons.

My Presentation on K-Means Clustering

I was very pleasured to be invited for the second time by the Toronto Area SAS Society (TASS) to deliver a presentation on machine learning.  (I previously presented on partial least squares regression.)  At its recent meeting on December 14, 2012, I introduced an unsupervised learning technique called K-means clustering.

I first defined clustering as a set of techniques for identifying groups of objects by maximizing a similarity criterion or, equivalently, minimizing a dissimilarity criterion.  I then defined K-means clustering specifically as a clustering technique that uses Euclidean proximity to a group mean as its similarity criterion.  I illustrated how this technique works with a simple 2-dimensional example; you can follow along this example in the slides by watching the sequence of images of the clusters toward convergence.  As with many other machine learning techniques, some arbitrary decisions need to be made to initiate the algorithm for K-means clustering:

  1. How many clusters should there be?
  2. What is the mean of each cluster?

I provided some guidelines on how to make these decisions in these slides.

K-means clustering has its limitations, and I raised cautions about when this technique is most appropriate.  Finally, I illustrated how this technique can be implemented in SAS and JMP.  JMP has 2 particularly good features for K-means clustering:

  • it uses a quantitative measure called the cubic clustering criterion (CCC) to compare different numbers of clusters (overcoming one of the two long-standing questions about K-means clustering with an objective, albeit imperfect, criterion)
  • users can compare the performances of multiple numbers of clusters at once using the CCC

As always, I encourage everybody to attend their local SAS Users Group meetings to learn from and network with other analytics professionals!  To everybody in Toronto: See you at the next TASS meeting!

Your thoughtful comments are much appreciated!