Machine Learning Lesson of the Day – Memory-Based Learning

Memory-based learning (also called instance-based learning) is a type of non-parametric algorithm that compares new test data with training data in order to solve the given machine learning problem.  Such algorithms search for the training data that are most similar to the test data and make predictions based on these similarities.  (From what I have learned, memory-based learning is used for supervised learning only.  Can you think of any memory-based algorithms for unsupervised learning?)

A distinguishing feature of memory-based learning is its storage of the entire training set.  This is computationally costly, especially if the training set is large – the storage itself is costly, and the complexity of the model grows with a larger data set.  However, it is advantageous because it uses less assumptions than parametric models, so it is adaptable to problems for which the assumptions may fail and no clear pattern is known ex ante.  (In contrast, parametric models like linear regression make generalizations about the training data; after building a model to predict the targets, the training data are discarded, so there is no need to store them.)  Thus, I recommend using memory-based learning algorithms when the data set is relatively small and there is no prior knowledge or information about the underlying patterns in the data.

Two classic examples of memory-based learning are K-nearest neighbours classification and K-nearest neighbours regression.

Machine Learning Lesson of the Day: The K-Nearest Neighbours Classifier

The K-nearest neighbours (KNN) classifier is a non-parametric classification technique that classifies an input X by

  1. identifying the K data (the K “neighbours”) in the training set that are closest to X
  2. counting the number of “neighbours” that belong to each class of the target variable
  3. classifying X by the most common class to which its neighbours belong

K is usually an odd number to avoid resolving ties.

The proximity of the neighbours to X is usually defined by Euclidean distance.

Validation or cross-validation can be used to determine the best number of “K”.