Evaluating Clustering in Machine Learning – Open Data Science

Category: Machine Learning

Table of Contents

Practical Application and Comparison of Evaluation Methods

Practical Application and Comparison of Evaluation Methods

Now, let’s consider a practical example where DBCV outshines the Silhouette score, and discuss some pertinent implications.

Suppose we’re dealing with a dataset composed of customer reviews spanning a variety of products.

The reviews are diverse, resulting in clusters of different shapes and sizes. Our chosen method is KMeans clustering, favored for its simplicity and efficiency.

However, upon utilizing the Silhouette score to evaluate our clustering, we were met with an unexpectedly high score. Sounds like excellent news, doesn’t it? But hold your horses!

Here’s the snag: the Silhouette score tends to prefer convex clusters, whilst our dataset comprises arbitrary-shaped clusters.

For example, consider the following:

source: https://math.stackexchange.com/a/4139343 (CC BY-SA)

In the above illustration, we can clearly see 3 different clusters — where each cluster is represented by its own concentric circle.

If we were to use something like KMeans, which like the Silhouette score favors convex clusters, we’ll get the following result:

source: https://math.stackexchange.com/a/4139343 (CC BY-SA)

Consequently, despite the elevated score, the clustering outcome from KMeans doesn’t meet our expectations.

It’s an archetypal scenario of a misleading metric.

Let’s change tack and employ DBCV for evaluation. Given its capability to handle arbitrary-shaped clusters, DBCV delivers a more precise evaluation of our KMeans clustering. It’s akin to obtaining a second opinion from a reliable source.

Using a technique which doesn’t directly favor convex clusters, in this example using Spectral clustering, we get a much more realistic result:

source: https://math.stackexchange.com/a/4139343 (CC BY-SA)

Note. The above plots were retrieved from the public discussion available here. I strongly recommend that you go through this answer since it provides a really solid ground for understanding this concept better.

DBCV’s advantages don’t end here, though.

One of its standout characteristics is its efficacy when ground truth labels aren’t at hand.

In numerous real-world circumstances, we don’t have the privilege of possessing ground truth labels. For instance, within our customer reviews dataset, we lack prior knowledge of how the reviews ought to be grouped. In such situations, DBCV comes to our aid, offering a dependable evaluation of our clustering.

Source link

mohsin

I am an author and tech enthusiast deeply passionate about AI, Data Science, and cutting-edge technologies. With expertise in Python, machine learning, and automation, he is dedicated to simplifying complex concepts, helping readers navigate and excel in the dynamic world of artificial intelligence and data science.

See All Posts