An empirical comparison between stochastic and deterministic centroid initialisation for K-means variations

Avgoustinos Vouros,S. Langdell,Mike Croucher,E. Vasilaki

Published 2019 in Machine-mediated learning

ABSTRACT

K-Means is one of the most used algorithms for data clustering and the usual clustering method for benchmarking. Despite its wide application it is well-known that it suffers from a series of disadvantages; it is only able to find local minima and the positions of the initial clustering centres (centroids) can greatly affect the clustering solution. Over the years many K-Means variations and initialisation techniques have been proposed with different degrees of complexity. In this study we focus on common K-Means variations along with a range of deterministic and stochastic initialisation techniques. We show that, on average, more sophisticated initialisation techniques alleviate the need for complex clustering methods. Furthermore, deterministic methods perform better than stochastic methods. However, there is a trade-off: less sophisticated stochastic methods, executed multiple times, can result in better clustering. Factoring in execution time, deterministic methods can be competitive and result in a good clustering solution. These conclusions are obtained through extensive benchmarking using a range of synthetic model generators and real-world data sets.

PUBLICATION RECORD

Publication year
2019
Venue
Machine-mediated learning
Publication date
2019-08-26
Fields of study
Mathematics, Computer Science
Identifiers
DOI 10.1007/s10994-021-06021-7 arXiv 1908.09946
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

A semi-supervised sparse K-Means algorithm
2020cited by this paper
dbscan: Fast Density-Based Clustering with R
2019cited by this paper
How much can k-means be improved by using better initialization and repeats?
2019influential reference
K-means properties on six clustering benchmark datasets
2018influential reference
Robust and sparse k-means clustering for high-dimensional data
2017influential reference
An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data
2017influential reference
A Comparison of Latent Class, K-Means, and K-Median Methods for Clustering Dichotomous Data
2017cited by this paper
RSKC: An R Package for a Robust and Sparse K-Means Clustering Algorithm
2016cited by this paper
Clustering by Fast Search and Find of Density Peaks
2016cited by this paper
Understanding the K-Medians Problem
2015cited by this paper
Detailed classification of swimming paths in the Morris Water Maze: multiple strategies within one trial
2015cited by this paper
Density K-means: A new algorithm for centers initialization for K-means
2015influential reference
Clustering by fast search and find of density peaks
2014cited by this paper
R: A language and environment for statistical computing.
2014cited by this paper
Active subclustering
2014cited by this paper
A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm
2012influential reference
Data reduction for weighted and outlier-resistant clustering
2012influential reference
Internal versus External cluster validation indexes
2011cited by this paper
A Framework for Feature Selection in Clustering
2010influential reference
Python: the tutorial
2009cited by this paper
Robust partitional clustering by outlier and density insensitive seeding
2009influential reference
Data clustering: 50 years beyond K-means
2008influential reference
Approaches to working in high-dimensional data spaces: gene expression microarrays
2008cited by this paper
k-means++: the advantages of careful seeding
2007influential reference
Determining the Number of Clusters Using the Weighted Gap Statistic
2007influential reference
Iterative shrinking method for clustering problems
2006influential reference
Integrating constraints and metric learning in semi-supervised clustering
2004cited by this paper
A Dynamic local search algorithm for the clustering problem
2002influential reference
LOF: identifying density-based local outliers
2000influential reference
Estimating the number of clusters in a dataset via the gap statistic
2000influential reference
An empirical comparison of four initialization methods for the K-Means algorithm
1999influential reference
A new initialization technique for generalized Lloyd iteration
1994influential reference
An Empirical Assessment of Algorithms for Constructing a Minimum Spanning Tree
1992influential reference
Breakdown Points of Affine Equivariant Estimators of Multivariate Location and Covariance Matrices
1991influential reference
Finding Groups in Data: An Introduction to Cluster Analysis
1990influential reference
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
1987influential reference
Clustering to Minimize the Maximum Intercluster Distance
1985influential reference
A k-means clustering algorithm
1979influential reference
Clustering Algorithms
1975influential reference
Some methods for classification and analysis of multivariate observations
1967influential reference
Multidimensional group analysis
1966cited by this paper
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Hartigan’s K-Means Versus Lloyd’s K-Means – Is It Time for a Change?
year unknowninfluential reference

CITED BY

Revealing nested archetypes of cropland abandonment based on social-ecological system theory
2026cites this paper
A Stratified Seed Selection Algorithm for $K$-Means Clustering on Big Data
2025cites this paper
Balanced seed selection for K-means clustering with determinantal point process
2025cites this paper
Dosing trajectories of antihypertensive agents among preterm neonates: A retrospective, cross-sectional analysis
2025cites this paper
Clustering public hospitals based on crisp and fuzzy clustering techniques and probabilistic fuzzy efficiency estimates
2025cites this paper
ParDP: A Parallel Density Peaks-Based Clustering Algorithm
2025cites this paper
Noise-aware celestial clustering for hot topic detection from microblog datasets with not well-separated topics
2024cites this paper
Deep behavioural representation learning reveals risk profiles for malignant ventricular arrhythmias
2024cites this paper
Time-constrained Gaussian mixture model for clustering multi-modal chemical process data
2024cites this paper
Development of E-Tourism to Achieve Excellence and Sustainable Development in Tourism: Ha’il Region Case Study
2024cites this paper
Exposing and explaining fake news on-the-fly
2024cites this paper
Discriminative Dimension Selection for Enhancing the Interpretability and Performance of Clustering Output
2024cites this paper
Improving Clustering Accuracy of K-Means and Random Swap by an Evolutionary Technique Based on Careful Seeding
2023cites this paper
The Impact of Digitization to Ensure Competitiveness of the Ha’il Region to Achieve Sustainable Development Goals
2023cites this paper
SKIFF: Spherical K-means with iterative feature filtering for text document clustering
2023cites this paper
Performance of a K-Means Algorithm Driven by Careful Seeding
2023cites this paper
On k-means iterations and Gaussian clusters
2023cites this paper
Two Medoid-Based Algorithms for Clustering Sets
2023cites this paper
An Efficient Algorithm for Clustering Sets
2023cites this paper
A Study of Deep Fuzzy Clustering Method Based on Maximum Entropy Clustering
2023cites this paper
Performance of Parallel K-Means Algorithms in Java
2022influential citation
Analysis of EEG microstates to predict epileptic seizures in an online approach
2022cites this paper
Parallel random swap: An efficient and reliable clustering algorithm in Java
2022cites this paper
Initializing FWSA K-Means With Feature Level Constraints
2022cites this paper
Fitting a collider in a quantum computer: tackling the challenges of quantum machine learning for big datasets
2022cites this paper
Efficient and Reliable Clustering by Parallel Random Swap Algorithm
2022cites this paper
Strategies discovery in the active allothetic place avoidance task
2022cites this paper
A semi-supervised sparse K-Means algorithm
2020influential citation