General
The K-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems.
The KNN algorithm assumes that similar things exist close to each other. In other words, similar things are close together.
The entire training dataset is stored. When a prediction is required, the k-most similar records to a new record from the training dataset are then located. From these neighbors, a summarized prediction is made.
KNN captures the idea of similarity (sometimes called distance, proximity, or closeness) with some mathematics we might have learned in our childhood — calculating the distance between points on a graph.
There are other ways of calculating distance, and one way might be preferable depending on the problem we are solving. The Euclidean distance is a popular and familiar choice. More detail in here
There is no model to speak of other than holding the entire training dataset. Because no work is done until a prediction is required, KNN is often referred to as a lazy learning method.
K-Nearest Neighbors Algorithm
Step by step
- Initialize K to your chosen number of neighbors
-
For
example
indata
:- Calculate the distance between the query example and the current example from the data.
- Add the distance and the index of the example to an ordered collection.
- Sort the ordered collection of distances and indices from smallest to largest by the distances.
- Pick the first K entries from the sorted collection (Doesn’t contain current point).
- Get the labels of the selected K entries.
- Result:
- If regression, return the mean of the K labels.
- If classification, return the mode of the K labels.
Source code with Python
Data preprocessing
I highly recommend that you preprocess the data. This makes calculating the distance easy and doesn’t have to deal with too large numbers
Make sure you read this post before reading below.
Step 1: Get the neighbors
# Locate the most similar neighbors
def get_neighbors(train, point, k):
distances = list()
index = np.arange(len(train))
for row in train:
# np.linalg.norm: Euclidean distance, default is norm 2
distances.append(np.linalg.norm(point - row))
# Grab indices and distances together in 1 set
distances = sorted(zip(distances,index))
# index from 1 because point appears in train
neighbors = [distances[i][1] for i in range(1, k + 1)]
return neighbors
Step 2: Predictions
# Depending on the regression or the classification. Here is classification
def prediction_classification(X_train, y_train, k):
labels = []
for point in X_train:
neighbors = get_neighbors(X_train, point, k)
# Get value of y from the existing index from get_neighbors
label_i = [y_train[index] for index in neighbors]
# from scipy import stats
# mode of label_i
most = stats.mode(labels_i)[0]
labels.append(most)
return labels
Choosing the right values K
To select the K that’s right for your data, we run the KNN algorithm several times with different values of K and choose the K that reduces the number of errors we encounter.
Keep in mind:
- As we decrease the value of K to 1, our predictions become less stable.
- Inversely, as we increase the value of K, our predictions become more stable due to majority voting / averaging and thus, more likely to make more accurate predictions (up to a certain point). Eventually, we begin to witness an increasing number of errors. It is at this point we know we have pushed the value of K too far.
- Usually make K an odd number to have a tiebreaker.
Advantages
- The algorithm is simple and easy to implement.
- There’s no need to build a model, tune several parameters, or make additional assumptions.
- The algorithm is versatile. It can be used for classification, regression, and search.
Disadvantages
- The algorithm gets significantly slower as the number of examples variables increase.
- KNN is very sensitive to noise when K is small.