Enhancing Similarity Measures for Binary Data in Clustering: The Role of Rare Events and Matching Absences

Published 2025 in Journal of Chemometrics

ABSTRACT

Clustering of binary data is central to various applications, particularly in the fields of medical diagnostics, chemistry, and chemoinformatics. However, standard similarity measures often fail to capture the informative value of rare features and matching absences, treating all attributes as equally relevant. This can lead to suboptimal clustering, especially when informative patterns are hidden in low‐frequency features. This study proposes a probability‐weighted approach to measuring similarity, which gives more weight to rare features and accounts for the value of shared absences based on their occurrence probabilities. We analyze how this adjustment impacts clustering results, using visual comparisons and experiments on real datasets. The results show consistent gains in clustering precision and stability compared to standard measures. Our findings suggest that incorporating the rarity of features into similarity computation can offer a more reliable basis for clustering binary data, especially in domains where rare signals carry meaningful information.

PUBLICATION RECORD

Publication year
2025
Venue
Journal of Chemometrics
Publication date
2025-01-01
Fields of study
Not labeled
Identifiers
DOI 10.1002/cem.70061
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models
2021cited by this paper
The DP5 probability, quantification and visualisation of structural uncertainty in single molecules
2021cited by this paper
Theoretical and empirical analysis of filter ranking methods: Experimental study on benchmark DNA microarray data
2021cited by this paper
Distances and Similarity Measures in Chemometrics and Chemoinformatics
2020cited by this paper
A comparative analysis of similarity measures akin to the Jaccard index in collaborative recommendations: empirical and theoretical perspective
2020cited by this paper
Improved Deep Learning Based Method for Molecular Similarity Searching Using Stack of Deep Belief Networks
2020cited by this paper
THE EFFECT OF BINARY DATA TRANSFORMATION IN CATEGORICAL DATA CLUSTERING
2019cited by this paper
Beware the Jaccard: the choice of similarity measure is important and non-trivial in genomic colocalisation analysis
2019cited by this paper
Deep Learning for Deep Chemistry: Optimizing the Prediction of Chemical Patterns
2019cited by this paper
Machine Learning Consensus To Predict the Binding to the Androgen Receptor within the CoMPARA Project
2019cited by this paper
Generalized Read-Across (GenRA): A workflow implemented into the EPA CompTox Chemicals Dashboard.
2019cited by this paper
Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data
2019cited by this paper
Measuring similarity between gene interaction profiles
2019cited by this paper
Life beyond the Tanimoto coefficient: similarity measures for interaction fingerprints
2018cited by this paper
Machine learning in chemoinformatics and drug discovery.
2018cited by this paper
A probabilistic molecular fingerprint for big data settings
2018cited by this paper
On the Analysis of Compressed Chemical Fingerprints
2018cited by this paper
Novel Approach to Classify Plants Based on Metabolite-Content Similarity
2017cited by this paper
Similarity Measure for Molecular Structure: A Brief Review
2017cited by this paper
Finding an appropriate equation to measure similarity between binary vectors: case studies on Indonesian and Japanese herbal medicines
2016cited by this paper
Visualization of Similarity Measures for Binary Data and 2x2 Tables
2016cited by this paper
Binary data comparison using similarity indices and principal components analysis
2016cited by this paper
The ultrametric properties of binary datasets
2016cited by this paper
Unsupervised characterization of research institutions with task-force estimation
2015cited by this paper
A new twist on a very old binary similarity coefficient.
2015cited by this paper
A generalizable definition of chemical similarity for read-across
2014cited by this paper
Similarity maps - a visualization strategy for molecular fingerprints and machine-learning methods
2013cited by this paper
New Similarity Coefficients for Binary Data
2012cited by this paper
Similarity Coefficients for Binary Chemoinformatics Data: Overview and Extended Comparison Using Simulated and Real Data Sets
2012cited by this paper
Distance phenomena in high-dimensional chemical descriptor spaces: consequences for similarity-based approaches
2009cited by this paper
A Two‐Stage Probabilistic Approach to Multiple‐Community Similarity Indices
2008cited by this paper
Nullomers: Really a Matter of Natural Selection?
2007cited by this paper
Mathematical Correction for Fingerprint Similarity Measures to Improve Chemical Retrieval
2007cited by this paper
Significance testing of a cluster of multivariate binary variables: comparison of the tripartite T index to three common similarity measures
2006cited by this paper
Abundance‐Based Similarity Indices and Their Estimation When There Are Unseen Species in Samples
2006cited by this paper
A new statistical approach for assessing similarity of species composition with incidence and abundance data
2004cited by this paper
Binary-based similarity measures for categorical data and their application in Self- Organizing Maps
2004cited by this paper
An approach to similarity measurement of absence-presence data: the case that common zeros matter
2004cited by this paper
Binary Quantitative Structure-Activity Relationship (QSAR) Analysis of Estrogen Receptor Ligands
1999cited by this paper
BINARY (PRESENCE-ABSENCE) SIMILARITY COEFFICIENTS
1969cited by this paper

CITED BY

No citing papers are available for this paper.