2016 | OriginalPaper | Chapter

# 9. Supervised Learning: Practice and Theory of Classification with the k-NN Rule

Published in:
Introduction to HPC with MPI for Data Science

## Abstract

The k-NN classification rule labels a new observation query q from a test set by choosing the dominant class among the k nearest neighbors of q in the training set. One evaluates the performance of a classifier by calculating its F-score that is the harmonic mean of the precision rate and the recall rate. This yields a single quality value that takes into account the four different cases that can occur when classifying observations in one of either two classes (false/true-positive/negative). In statistical machine learning, a classifier can never beat the optimal Bayes’ error (or the probability of error), and the 1-NN guarantees asymptotically an error factor of 2. Since the nearest neighbor queries are decomposable queries, meaning that \(\mathrm {NN}_k(q,X_1\cup X_2)=\mathrm {NN}_k(\mathrm {NN}_k(q,X_1),\mathrm {NN}_k(q,X_2))\), the k-NN classification rule can be easily parallelized on a distributed memory architecture like a computer cluster. One of the drawback of the k-NN rule is that it needs to store all the training set in order to classify new observations.