• Home
  • About
    • Pywind photo

      Pywind

      This is py blog !

    • Learn More
    • Facebook
    • Instagram
    • Github
    • StackOverflow
    • Youtube
  • Posts
    • All Posts
    • All Tags
  • Projects

[ML] K-means Clustering

16 Sep 2020

Reading time ~3 minutes

Generality

  • Other than supervised learning, here we don’t know the labels for each data point.
  • The task is to classify the data into clusters so that the clusters have similar properties
  • A cluster is a collection of points close together in space:
    • Each cluster has a center, the points near it are assigned labels of that center

Mathematics

General

  • N data points \(X = [x_1, x_2,x_3,...,x_n]\in\mathbb{R}^{d.N}\)
  • K < N is number of clusters.
    • Find center \(M = [m_1,m_2,...m_k]\in\mathbb{R}^d\)
    • Set \(y=[y_1,y_2,...,y_n]\) where \(y_i\) is the label of \(x_i\)
    • If \(x_i\) is classified into cluster k, then \(y_i\) = k
  • Sign: \(y_{i}^{(k)}\in\left\{0,1\right \},\sum_{k=1}^{k}y_{i}^{(k)}=1\) it mean each point belongs only to a single cluster.

Cost function

  • \(m_k\) is the center of each cluster and estimates the points in this cluster.
  • Each point \(x_i\) in cluster k has error is \((x_i - m_k)\).
    • Error: \(\left\|x_i-m_k\right\|^2\) (Euclidean norm, “euclidean distance”)

Summary, Cost function for data set is: \(C(y, m)=\sum_{i=1}^{N}\sum_{j=1}^{K}y_{j}^{(i)}\left \|x_i-m_j\right\|^2\)

And we must find: \(y, m = \underset{y,m}{\arg\,min}\;\sum_{i=1}^{N}\sum_{j=1}^{K}y_{j}^{(i)}\left \|x_i-m_j\right\|^2\)

This is a two-variable function, so the easiest solution is to fix one variable and find the other as you learned in high school.

Fixed m, find y

Suppose found the centers, find labels so that C (meaning, m) is min.

Because each point only belongs to the closest cluster.

\[j=\underset{j}{\arg\;min}\left \|x_i-m_j\right\|^2\]

Fixed y, find m

Suppose found clusters for each point and find center to minimum cost.

\[m_j=\underset{m_j}{\arg\;min}\sum_{i=1}^{N}y_i\left \|x_i-m_j\right\|^2\]

Derivative:

\[\frac{\partial}{\partial m_j}\sum_{i=1}^{N}y_i\left \|x_i-m_j\right\|^2=2\sum_{i=1}^{N}y_i(m_j-x_i)\]

Set equation = 0

\[2\sum_{i=1}^{N}y_i(m_j-x_i)=0\] \[\sum_{i=1}^{N}y_im_j-\sum_{j=1}^{N}y_ix_i=0\] \[\sum_{i=1}^{N}y_im_j=\sum_{j=1}^{N}y_ix_i\] \[m_j=\frac{\sum_{j=1}^{N}y_ix_i}{\sum_{i=1}^{N}y_i}\]
  • Denominator: data points in the cluster j
  • Numerator: the total number of data points in the cluster j

So, \(m_j\) is mean of data points in cluster j, K-means clustering is born.

Algorithm

Input: Data X and k clusters. Output: Center M and label vector for each data points

  • Step 1: Randomly select k points as the original centers.
  • Step 2: Assign the data points to the cluster closest.
  • Step 3: If the vector labels do not change from Step 2 => stop the algorithm.
  • Step 4: Update the center for each cluster by taking the mean of all the data points assigned to the front of the cluster from Step 2
  • Step 5: Go back Step 2.

Source code by Python

Pick centers (Step 1)

def init_centers(X, k):
    # Random pick k center from data
    return X[np.random.choice(X.shape[0], k, replace=False)]

Assign data point

def assign_labels(X, centers):
    # Compute Euclid norm by using 
    DE = cdist(X, centers, 'euclidean')
    return np.argmin(DE, axis = 1)

Update centers

def update_centers(X, labels, k):
    # Create new centers
    centers = np.zeros((k, X.shape[1]))
    for i in range(k):
        # Take all elements in cluster i
        Xi = X[labels == i, :]
        # Update centers
        centers[i, :] = np.mean(Xi, axis = 0)
    return centers

Main K-means

def K_means(X, k):
    centers = init_centers(X, k)
    labels = []
    while True:
        # save pre_labels
        labels_old = labels
        # group data to a clusters
        labels = assign_labels(X, centers)
        # Check the change of centers
        if (np.array_equal(labels_old, labels)):
            break
        # Updates centers
        centers = update_centers(X, labels, k)
    return (centers, labels)


MLunsupervisedKclustering Share Tweet +1