Selecting Number Of Clusters in K-Mean Clustering

Ankita Banerji
4 min readJan 18, 2021

--

K-means clustering is an unsupervised algorithm. In unsupervised algorithm, we are not interested in making predictions (since we don’t have target/output variable). There are certain factors that can impact the efficacy of the final clusters formed when using k-means clustering. So, we must keep in mind the following factors when solving business problems using K-means clustering algorithm.

1. Number of clusters (K): The number of clusters you want to group your data points into, has to be predefined.

2. Initial Values/ Seeds: Choice of the initial cluster centres can have an impact on the final cluster formation.

3. Outliers: Cluster formation is very sensitive to the presence of outliers. Outliers pull the cluster towards itself, thus affecting optimal cluster formation.

4. Distance Measures: Using different distance measures (used to calculate distance between a data point and cluster centre) might yield different clusters.

In this blog, we will discuss the most important parameter i.e., the ways by which we can select number of clusters (K). There are several methods to find best value of K. We will discuss them one by one.

1. Elbow Curve Method

The elbow method runs k-means clustering on the dataset for a range of values of k (say 1 to 10).

  • Perform K-means clustering with all these different values of K. For each of the K values, we calculate average distances to the centroid across all data points.
  • Plot these points and find the point where average distance from the centroid falls suddenly (“Elbow”).

Let us see the python code with the help of an example.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
X1 = [3, 1, 1, 2, 1, 6, 6, 6, 5, 6, 7, 8, 9, 8, 9, 9, 8]
X2 = [5, 4, 5, 6, 5, 8, 6, 7, 6, 7, 1, 2, 1, 2, 3, 2, 3]
plt.scatter(X1,X2)
plt.show()
Scatter Plot between X1 and X2

Visually we can see that the optimal number of clusters should be around 3. But visualizing the data alone cannot always give the right answer.

Sum_of_squared_distances = []
K = range(1,10)
for num_clusters in K :
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(data_frame)
Sum_of_squared_distances.append(kmeans.inertia_)

plt.plot(K,Sum_of_squared_distances,’bx-’)
plt.xlabel(‘Values of K’)
plt.ylabel(‘Sum of squared distances/Inertia’)
plt.title(‘Elbow Method For Optimal k’)
plt.show()
Line plot between K and inertia

The curve looks like an elbow. In the above plot, elbow is at k=3(i.e. Sum of squared distances falls suddenly) indicating the optimal k for this dataset is 3.

2. Silhouette analysis

Silhouette coefficient is a measure of how similar a data point is to its own cluster (cohesion) compared to other clusters (separation).

  • Select a range of values of k (say 1 to 10).
  • Plot Silhouette coefficient for each value of K.

The equation for calculating the silhouette coefficient for a particular data point:

  • S(i) is the silhouette coefficient of the data point i.
  • a(i) is the average distance between i and all the other data points in the cluster to which i belongs.
  • b(i) is the average distance from i to all clusters to which i does not belong.
a(i)
b(i)

We will then calculate average_silhouette for every k.

Then plot graph between average_silhouette and K.

Points to remember while calculating silhouette coefficient:

  • The value of the silhouette coefficient is between [-1, 1].
  • A score of 1 denotes the best meaning that the data point i is very compact within the cluster to which it belongs and far away from the other clusters.
  • The worst value is -1. Values near 0 denote overlapping clusters.

Let us see the python code with help of an example.

range_n_clusters = [2, 3, 4, 5, 6, 7, 8]
silhouette_avg = []
for num_clusters in range_n_clusters:

# initialise kmeans
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(data_frame)
cluster_labels = kmeans.labels_

# silhouette score
silhouette_avg.append(silhouette_score(data_frame, cluster_labels))
plt.plot(range_n_clusters,silhouette_avg,’bx-’)
plt.xlabel(‘Values of K’)
plt.ylabel(‘Silhouette score’)
plt.title(‘Silhouette analysis For Optimal k’)
plt.show()
Line plot between K and Silhouette score

We see that the silhouette score is maximized at k = 3. So, we will take 3 clusters.

--

--