Clustering of binary data is central to various applications, particularly in the fields of medical diagnostics, chemistry, and chemoinformatics. However, standard similarity measures often fail to capture the informative value of rare features and matching absences, treating all attributes as equally relevant. This can lead to suboptimal clustering, especially when informative patterns are hidden in low‐frequency features. This study proposes a probability‐weighted approach to measuring similarity, which gives more weight to rare features and accounts for the value of shared absences based on their occurrence probabilities. We analyze how this adjustment impacts clustering results, using visual comparisons and experiments on real datasets. The results show consistent gains in clustering precision and stability compared to standard measures. Our findings suggest that incorporating the rarity of features into similarity computation can offer a more reliable basis for clustering binary data, especially in domains where rare signals carry meaningful information.
Enhancing Similarity Measures for Binary Data in Clustering: The Role of Rare Events and Matching Absences
Tânia F.G.G. Cova,Alberto A. C. C. Pais
Published 2025 in Journal of Chemometrics
ABSTRACT
PUBLICATION RECORD
- Publication year
2025
- Venue
Journal of Chemometrics
- Publication date
2025-01-01
- Fields of study
Not labeled
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-40 of 40 references · Page 1 of 1
CITED BY
- No citing papers are available for this paper.
Showing 0-0 of 0 citing papers · Page 1 of 1