Distributed Representations of Tuples for Entity Resolution

Muhammad Ebraheem,Saravanan Thirumuruganathan,Shafiq R. Joty,M. Ouzzani,N. Tang

Published 2017 in Proceedings of the VLDB Endowment

ABSTRACT

Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representations of words ( a.k.a . word embeddings), we present a novel ER system, called D eep ER, that achieves good accuracy, high efficiency, as well as ease-of-use ( i.e ., much less human efforts). We use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation ( i.e ., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations that are customized for a specific ER task under different scenarios. We propose a locality sensitive hashing (LSH) based blocking approach that takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that D eep ER outperforms existing solutions.

PUBLICATION RECORD

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-55 of 55 references · Page 1 of 1

CITED BY

Showing 1-100 of 328 citing papers · Page 1 of 4