Efficient Clustering with Limited Distance Information

Konstantin Voevodski,Maria-Florina Balcan,Heiko Röglin,S. Teng,Yu Xia

Published 2010 in Conference on Uncertainty in Artificial Intelligence

ABSTRACT

Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s in S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our algorithm to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire dataset. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification.

PUBLICATION RECORD

Publication year
2010
Venue
Conference on Uncertainty in Artificial Intelligence
Publication date
2010-07-08
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1009.5168
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

The Pfam protein families database
2011influential reference
Stability Yields a PTAS for k-Median and k-Means Clustering
2010cited by this paper
Approximate clustering without the approximation
2009influential reference
Streaming k-means approximation
2009cited by this paper
The Pfam protein families database
2007cited by this paper
Sublinear‐time approximation algorithms for clustering via random sampling
2007cited by this paper
k-means++: the advantages of careful seeding
2007cited by this paper
A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering
2007cited by this paper
The Effectiveness of Lloyd-Type Methods for the k-Means Problem
2006cited by this paper
A divide-and-merge methodology for clustering
2005influential reference
Spectral clustering of protein sequences
2003influential reference
Virtual landmarks for the internet
2003cited by this paper
Performance guarantees for hierarchical clustering
2002cited by this paper
Sublinear time approximate clustering
2001cited by this paper
Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships.
1998cited by this paper
SCOP: a structural classification of proteins database for the investigation of sequences and structures.
1995cited by this paper
Basic local alignment search tool.
1990influential reference
Clustering to Minimize the Maximum Intercluster Distance
1985cited by this paper

CITED BY

Accelerating data-driven algorithm selection for combinatorial partitioning problems
2024cites this paper
“Intelligent Heuristics Are the Future of Computing”
2023cites this paper
Sketch-based Community Detection via Representative Node Sampling
2021cites this paper
On the Error Resistance of Hinge Loss Minimization
2020cites this paper
Scalable and Robust Community Detection With Randomized Sketching
2018cites this paper
Randomized Robust matrix Completion for the Community Detection Problem
2018cites this paper
Nash Equilibria in Perturbation-Stable Games
2017cites this paper
Approximate Clustering with Same-Cluster Queries
2017cites this paper
Approximate Correlation Clustering Using Same-Cluster Queries
2017cites this paper
Scalable Algorithms for Data and Network Analysis
2016cites this paper
Approximate Greedy Clustering and Distance Selection for Graph Metrics
2015cites this paper
Center Based Clustering: A Foundational Perspective
2014influential citation
Weighted Graph Clustering with Non-Uniform Uncertainties
2014cites this paper
Active transitivity clustering of large-scale biomedical datasets
2014cites this paper
Based Clustering : A Foundational Perspective
2014influential citation
Beyond Worst-Case Analysis in Privacy and Clustering: Exploiting Explicit and Implicit Assumptions
2013cites this paper
Approximation Algorithms and New Models for Clustering and Learning
2013cites this paper
Clustering under approximation stability
2013cites this paper
Why Do We Want a Good Ratio Anyway ? Approximation Stability and Proxy Objectives Scribe : Avrim
2011cites this paper
Clustering Protein Sequences Given the Approximation Stability of the Min-Sum Objective Function
2011cites this paper
Min-sum Clustering of Protein Sequences with Limited Distance Information
2011cites this paper
Why Do We Want a Good Ratio Anyway? Approximation Stability and Proxy Objectives
2011cites this paper
Clustering Partially Observed Graphs via Convex Optimization
2011cites this paper
Nash Equilibria in Perturbation Resilient Games
2010cites this paper
Clustering with or without the approximation
2010cites this paper
Stability Yields a PTAS for k-Median and k-Means Clustering
2010cites this paper
Approximate clustering without the approximation
2009cites this paper