A Generalized Single Linkage Method for Estimating the Cluster Tree of a Density

W. Stuetzle,R. Nugent

Published 2010 in Journal of Computational and Graphical Statistics

ABSTRACT

The goal of clustering is to detect the presence of distinct groups in a dataset and assign group labels to the observations. Nonparametric clustering is based on the premise that the observations may be regarded as a sample from some underlying density in feature space and that groups correspond to modes of this density. The goal then is to find the modes and assign each observation to the domain of attraction of a mode. The modal structure of a density is summarized by its cluster tree; modes of the density correspond to leaves of the cluster tree. Estimating the cluster tree is the primary goal of nonparametric cluster analysis. We adopt a plug-in approach to cluster tree estimation: estimate the cluster tree of the feature density by the cluster tree of a density estimate. For some density estimates the cluster tree can be computed exactly; for others we have to be content with an approximation. We present a graph-based method that can approximate the cluster tree of any density estimate. Density estimates tend to have spurious modes caused by sampling variability, leading to spurious branches in the graph cluster tree. We propose excess mass as a measure for the size of a branch, reflecting the height of the corresponding peak of the density above the surrounding valley floor as well as its spatial extent. Excess mass can be used as a guide for pruning the graph cluster tree. We point out mathematical and algorithmic connections to single linkage clustering and illustrate our approach on several examples. Supplemental materials for the article, including an R package implementing generalized single linkage clustering, all datasets used in the examples, and R code producing the figures and numerical results, are available online.

PUBLICATION RECORD

  • Publication year

    2010

  • Venue

    Journal of Computational and Graphical Statistics

  • Publication date

    2010-01-01

  • Fields of study

    Mathematics, Computer Science

  • Identifiers
  • External record

    Open on Semantic Scholar

  • Source metadata

    Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-26 of 26 references · Page 1 of 1

CITED BY

Showing 1-100 of 145 citing papers · Page 1 of 2