Enhancing Similarity Measures for Binary Data in Clustering: The Role of Rare Events and Matching Absences

Tânia F.G.G. Cova,Alberto A. C. C. Pais

Published 2025 in Journal of Chemometrics

ABSTRACT

Clustering of binary data is central to various applications, particularly in the fields of medical diagnostics, chemistry, and chemoinformatics. However, standard similarity measures often fail to capture the informative value of rare features and matching absences, treating all attributes as equally relevant. This can lead to suboptimal clustering, especially when informative patterns are hidden in low‐frequency features. This study proposes a probability‐weighted approach to measuring similarity, which gives more weight to rare features and accounts for the value of shared absences based on their occurrence probabilities. We analyze how this adjustment impacts clustering results, using visual comparisons and experiments on real datasets. The results show consistent gains in clustering precision and stability compared to standard measures. Our findings suggest that incorporating the rarity of features into similarity computation can offer a more reliable basis for clustering binary data, especially in domains where rare signals carry meaningful information.

PUBLICATION RECORD

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-40 of 40 references · Page 1 of 1

CITED BY

  • No citing papers are available for this paper.

Showing 0-0 of 0 citing papers · Page 1 of 1