Distributed Representations of Tuples for Entity Resolution

Muhammad Ebraheem,Saravanan Thirumuruganathan,Shafiq R. Joty,M. Ouzzani,N. Tang

Published 2017 in Proceedings of the VLDB Endowment

ABSTRACT

Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representations of words ( a.k.a . word embeddings), we present a novel ER system, called D eep ER, that achieves good accuracy, high efficiency, as well as ease-of-use ( i.e ., much less human efforts). We use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation ( i.e ., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations that are customized for a specific ER task under different scenarios. We propose a locality sensitive hashing (LSH) based blocking approach that takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that D eep ER outperforms existing solutions.

PUBLICATION RECORD

Publication year
2017
Venue
Proceedings of the VLDB Endowment
Publication date
2017-10-02
Fields of study
Computer Science
Identifiers
DOI 10.14778/3236187.3236198 arXiv 1710.00597
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Deep Learning for Entity Matching: A Design Space Exploration
2018cited by this paper
Technical Perspective:: Toward Building Entity Matching Management Systems
2018influential reference
Data Curation with Deep Learning [Vision]: Towards Self Driving Data Curation
2018cited by this paper
Technical Perspective:: Toward Building Entity Matching Management Systems
2018cited by this paper
Record Linkage
2018cited by this paper
Generating Concise Entity Matching Rules
2017influential reference
Human-in-the-Loop Challenges for Entity Matching: A Midterm Report
2017cited by this paper
Deep Learning with Python
2017cited by this paper
Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services
2017cited by this paper
Synthesizing Entity Matching Rules by Examples
2017influential reference
Benchmarks for measurement of duplicate detection methods in nucleotide databases
2016cited by this paper
Semantic-Aware Blocking for Entity Resolution
2016cited by this paper
Deep Learning
2016influential reference
Enriching Word Vectors with Subword Information
2016influential reference
Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics
2015cited by this paper
Fine-grained Opinion Mining with Recurrent Neural Networks and Word Embeddings
2015cited by this paper
A Clustering-Based Framework to Control Block Sizes for Entity Resolution
2015cited by this paper
When Are Tree Structures Necessary for Deep Learning of Representations?
2015cited by this paper
Hubness and Pollution: Delving into Cross-Space Mapping for Zero-Shot Learning
2015cited by this paper
How transferable are features in deep neural networks?
2014cited by this paper
GloVe: Global Vectors for Word Representation
2014influential reference
Hashing for Similarity Search: A Survey
2014influential reference
Corleone: hands-off crowdsourcing for entity matching
2014cited by this paper
Distributed Representations of Sentences and Documents
2014cited by this paper
A Comparison of Blocking Methods for Record Linkage
2014cited by this paper
MFIBlocks: An effective blocking algorithm for entity resolution
2013cited by this paper
Distributed Representations of Words and Phrases and their Compositionality
2013influential reference
Efficient Estimation of Word Representations in Vector Space
2013cited by this paper
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
2013cited by this paper
Representation Learning: A Review and New Perspectives
2012cited by this paper
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication
2012cited by this paper
Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality
2012cited by this paper
CrowdER: Crowdsourcing Entity Resolution
2012cited by this paper
Entity Matching: How Similar Is Similar
2011cited by this paper
Torch7: A Matlab-like Environment for Machine Learning
2011cited by this paper
Natural Language Processing (Almost) from Scratch
2011cited by this paper
Composition in Distributional Models of Semantics
2010cited by this paper
Evaluation of entity resolution approaches on real-world match problems
2010cited by this paper
An Introduction to Duplicate Detection
2010cited by this paper
LSH banding for large-scale retrieval with memory and recall constraints
2009cited by this paper
Data Quality in Data Warehouses
2009cited by this paper
Training selection for tuning entity matching
2008cited by this paper
Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search
2007cited by this paper
Duplicate Record Detection: A Survey
2007cited by this paper
Learning Blocking Schemes for Record Linkage
2006cited by this paper
A Comparison of Fast Blocking Methods for Record Linkage
2003cited by this paper
A Neural Probabilistic Language Model
2003influential reference
Adaptive duplicate detection using learnable string similarity measures
2003influential reference
Interactive deduplication using active learning
2002cited by this paper
Learning to match and cluster large high-dimensional data sets for data integration
2002cited by this paper
Similarity Search in High Dimensions via Hashing
1999influential reference
Bidirectional recurrent neural networks
1997influential reference
Long Short-Term Memory
1997cited by this paper
Learning long-term dependencies with gradient descent is difficult
1994cited by this paper
A Theory for Record Linkage
1969influential reference

CITED BY

Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs
2026cites this paper
ALER: An Active Learning Hybrid System for Efficient Entity Resolution
2026cites this paper
The impact of fine-tuning on entity resolution: An experimental evaluation
2026cites this paper
Graph-Based Vector Search: An Experimental Evaluation of the State-of-the-Art
2025cites this paper
Tailoring the Shapley Value for In-Context Example Selection Towards Data Wrangling
2025cites this paper
Rule-Based Graph Cleaning with GPUs on a Single Machine
2025cites this paper
Structured Multi-Step Reasoning for Entity Matching Using Large Language Model
2025cites this paper
Large Language Models for Data Discovery and Integration: Challenges and Opportunities
2025cites this paper
New Trends in Data Forgetting for Sustainable Data Management
2025cites this paper
RAG-Driven Data Quality Governance for Enterprise ERP Systems
2025cites this paper
3dSAGER: Geospatial Entity Resolution over 3D Objects
2025cites this paper
Large Language Models for Semantic Join: A Comprehensive Survey
2025cites this paper
Description-Similarity Rules: Towards Flexible Feature Engineering for Entity Matching
2025cites this paper
CAFE+: Towards Compact, Adaptive, and Fast Embedding for Large-scale Online Recommendation Models
2025cites this paper
Towards uncertainty-calibrated structural data enrichment with large language model for few-shot entity resolution
2025cites this paper
Data Cleansing Methods for Big Data: A Systematic Review
2025cites this paper
Privacy and Accuracy-Aware AI/ML Model Deduplication
2025cites this paper
TREATS: Fairness-aware entity resolution over streaming data
2025cites this paper
ThriftLLM: On Cost-Effective Selection of Large Language Models for Classification Queries
2025cites this paper
UniClean: A Scalable Data Cleaning Solution for Mixed Errors based on Unified Cleaners and Optimized Cleaning Workflow
2025cites this paper
On the Asymmetrical Nature of Entity Matching Using Pre-Trained Transformers
2025cites this paper
Reliable Low-Resource Entity Resolution Enhanced by Rules
2025cites this paper
Building a Real-Time Identity Matching System Using Deep Learning and Graph Modeling
2025cites this paper
Unveiling Hidden Gems: Enhancing Entity Resolution with a Data Perspective
2025cites this paper
TransClean: Finding False Positives in Multi-Source Entity Matching Under Real-World Conditions via Transitive Consistency
2025cites this paper
SkillBridge: A Job-Course Matching Framework Using Pre-Trained Language Models and Graph Representation
2025cites this paper
Evaluating Methods for Efficient Entity Count Estimation
2025cites this paper
Scaling Entity Resolution with K-Means: A Review of Partitioning Techniques
2025cites this paper
Deduplicated Sampling On-Demand
2025cites this paper
KnowTrans: Boosting Transferability of Data Preparation LLMs via Knowledge Augmentation
2025cites this paper
When GDD meets GNN: A knowledge-driven neural connection for effective entity resolution in property graphs
2025cites this paper
In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration
2025cites this paper
How to Talk to Language Models: Serialization Strategies for Structured Entity Matching
2025cites this paper
A Deep Dive Into Cross-Dataset Entity Matching with Large and Small Language Models
2025cites this paper
PUER: Boosting Few-shot Positive-Unlabeled Entity Resolution with Reinforcement Learning
2025cites this paper
Leveraging Large Language Models for Generating Labeled Mineral Site Record Linkage Data
2024cites this paper
Accurate Customer Address Matching via Weak Supervision for Geocode Learning
2024cites this paper
Better entity matching with transformers through ensembles
2024cites this paper
Efficient Entity Resolution via Hierarchical Graph Attention and Semantic Blocking
2024cites this paper
Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and Discrepancy
2024cites this paper
Enriching Relations with Additional Attributes for ER
2024cites this paper
LRER: A Low-Resource Entity Resolution Framework with Hybrid Information
2024cites this paper
A Critical Re-evaluation of Record Linkage Benchmarks for Learning-Based Matching Algorithms
2024cites this paper
EMBA: Entity Matching using Multi-Task Learning of BERT with Attention-over-Attention
2024cites this paper
Table integration in data lakes unleashed: pairwise integrability judgment, integrable set discovery, and multi-tuple conflict resolution
2024cites this paper
Evaluating Blocking Biases in Entity Matching
2024influential citation
Explaining Entity Matching with Clusters of Words
2024cites this paper
TableDC: Deep Clustering for Tabular Data
2024cites this paper
Learning from Natural Language Explanations for Generalizable Entity Matching
2024cites this paper
StructAM: Enhancing Address Matching through Semantic Understanding of Structure-aware Information
2024cites this paper
Open benchmark for filtering techniques in entity resolution
2024cites this paper
Rock: Cleaning Data by Embedding ML in Logic Rules
2024cites this paper
Matching Feature Separation Network for Domain Adaptation in Entity Matching
2024cites this paper
Building Taxonomies with Triplet Queries
2024cites this paper
UniDM: A Unified Framework for Data Manipulation with Large Language Models
2024cites this paper
Unicorn: A Unified Multi-Tasking Matching Model
2024cites this paper
When GDD meets GNN: A Knowledge-driven Neural Connection for Effective Entity Resolution in Property Graphs
2024cites this paper
HyperBlocker: Accelerating Rule-based Blocking in Entity Resolution using GPUs
2024influential citation
APrompt4EM: Augmented Prompt Tuning for Generalized Entity Matching
2024cites this paper
BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments
2024cites this paper
Making It Tractable to Detect and Correct Errors in Graphs
2024cites this paper
Reducing Biases in Record Matching Through Scores Calibration
2024cites this paper
Using machine learning to link electronic health records in cancer registries: On the tradeoff between linkage quality and manual effort
2024cites this paper
Towards Semantic Layer for Enhancing Blocking Entity Resolution Accuracy in Big Data
2024cites this paper
PromptER: Prompt Contrastive Learning for Generalized Entity Resolution
2024cites this paper
Connected Components for Scaling Partial-order Blocking to Billion Entities
2024cites this paper
Neural Locality Sensitive Hashing for Entity Blocking
2024cites this paper
Leveraging Pretrained Language Models for Enhanced Entity Matching: A Comprehensive Study of Fine-Tuning and Prompt Learning Paradigms
2024cites this paper
Unsupervised Domain Adaptation for Entity Blocking Leveraging Large Language Models
2024cites this paper
Machine Learning for Refining Knowledge Graphs: A Survey
2024influential citation
Dual-Module Feature Alignment Domain Adversarial Model for Entity Resolution
2024cites this paper
Adaptive Target-Consistency Entity Matching Algorithm Based on Semi-Supervised Learning
2024cites this paper
An efficient learning based approach for automatic record deduplication with benchmark datasets
2024cites this paper
Exif2Vec: A Framework to Ascertain Untrustworthy Crowdsourced Images Using Metadata
2024cites this paper
Towards Universal Dense Blocking for Entity Resolution
2024cites this paper
Active in-context learning for cross-domain entity resolution
2024cites this paper
A simple and efficient approach to unsupervised instance matching and its application to linked data of power plants
2024cites this paper
Resolving duplicates in Large Multiple-Choice Questions Repositories
2024cites this paper
A Survey on Knowledge Graph Related Research in Smart City Domain
2024cites this paper
Product Entity Matching via Tabular Data
2023cites this paper
Innovation for improving climate-related data—Lessons learned from setting up a data hub
2023cites this paper
Experimental Analysis of Large-scale Learnable Vector Storage Compression
2023cites this paper
Entity Matching using Large Language Models
2023cites this paper
Selecting Walk Schemes for Database Embedding
2023cites this paper
Secure Cloud-Aided Approximate Nearest Neighbor Search on High-Dimensional Data
2023cites this paper
The Battleship Approach to the Low Resource Entity Matching Problem
2023cites this paper
Research on the mechanism of teaching culture in Civics course based on deep learning model
2023cites this paper
Automatic Data Repair: Are We Ready to Deploy?
2023cites this paper
A Domain-Oriented Entity Alignment Approach Based on Filtering Multi-Type Graph Neural Networks
2023cites this paper
CampER: An Effective Framework for Privacy-Aware Deep Entity Resolution
2023cites this paper
Domain-Generic Pre-Training for Low-Cost Entity Matching via Domain Alignment and Domain Antagonism
2023cites this paper
MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching
2023cites this paper
Towards a systemic entrepreneurship activity model
2023cites this paper
An Effective Framework for Enhancing Query Answering in a Heterogeneous Data Lake
2023cites this paper
Towards new-generation human-centric smart manufacturing in Industry 5.0: A systematic review
2023cites this paper
Extracting Graphs Properties with Semantic Joins
2023cites this paper
Data cleaning and machine learning: a systematic literature review
2023influential citation
Soft Target-Enhanced Matching Framework for Deep Entity Matching
2023cites this paper
Matching Roles from Temporal Data: Why Joe Biden is not only President, but also Commander-in-Chief
2023cites this paper
Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration
2023cites this paper