Due to rapid development of the Internet, recent years have witnessed an explosion in the rate of data generation. Dealing with data at current scales brings up unprecedented challenges. From the algorithmic view point, executing existing linear algorithms in information retrieval and machine learning on such tremendous amounts of data incur intolerable computational and storage costs. To address this issue, there is a growing interest to map data points in large-scale datasets to binary codes. This can significantly reduce the storage complexity of large-scale datasets. However, one of the most compelling reasons for using binary codes or any discrete representation is that they can be used as direct indices into a hash table. Incorporating hash table offers fast query execution; one can look up the nearby buckets in a hash table populated with binary codes to retrieve similar items. Nonetheless, if binary codes are compared in terms of the cosine similarity rather than the Hamming distance, there is no fast exact sequential procedure to find the $K$ closest items to the query other than the exhaustive search. Given a large dataset of binary codes and a binary query, the problem that we address is to efficiently find $K$ closest codes in the dataset that yield the largest cosine similarities to the query. To handle this issue, we first elaborate on the relation between the Hamming distance and the cosine similarity. This allows finding the sequence of buckets to check in the hash table. Having this sequence, we propose a multi-index hashing approach that can increase the search speed up to orders of magnitude in comparison to the exhaustive search and even approximation methods such as LSH. We empirically evaluate the performance of the proposed algorithm on real world datasets.
Cosine Similarity Search with Multi Index Hashing
Published 2016 in arXiv.org
ABSTRACT
PUBLICATION RECORD
- Publication year
2016
- Venue
arXiv.org
- Publication date
2016-09-14
- Fields of study
Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-52 of 52 references · Page 1 of 1
CITED BY
Showing 1-3 of 3 citing papers · Page 1 of 1