Unicorn: A Unified Multi-Tasking Matching Model

Ju Fan,Jianhong Tu,Guoliang Li,Peng Wang,Xiaoyong Du,Xiaofeng Jia,Song Gao,Nan Tang

Published 2024 in SIGMOD record

ABSTRACT

Data matching, which decides whether two data elements (e.g., string, tuple, column, or knowledge graph entity) are the "same" (a.k.a. a match), is a key concept in data integration. The widely used practice is to build task-specific or even dataset-specific solutions, which are hard to generalize and disable the opportunities of knowledge sharing that can be learned from different datasets and multiple tasks. In this paper, we propose Unicorn, a unified model for generally supporting common data matching tasks. Building such a unified model is challenging due to heterogeneous formats of input data elements and various matching semantics of multiple tasks. To address the challenges, Unicorn employs one generic Encoder that converts any pair of data elements (a, b) into a learned representation, and uses a Matcher, which is a binary classifier, to decide whether a matches b. To align matching semantics of multiple tasks, Unicorn adopts a mixture-of-experts model that enhances the learned representation into a better representation. We conduct extensive experiments using 20 datasets on 7 well-studied data matching tasks, and find that our unified model can achieve better performance on most tasks and on average, compared with the state-of-the-art specific models trained for ad-hoc tasks and datasets separately. Moreover, Unicorn can also well serve new matching tasks with zero-shot learning.

PUBLICATION RECORD

Publication year
2024
Venue
SIGMOD record
Publication date
2024-05-14
Fields of study
Computer Science
Identifiers
DOI 10.1145/3665252.3665263
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Jellyfish: A Large Language Model for Data Preprocessing
2023cited by this paper
Table-GPT: Table-tuned GPT for Diverse Table Tasks
2023cited by this paper
Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes
2023cited by this paper
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning
2023cited by this paper
GPT-4 Technical Report
2023cited by this paper
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
2022cited by this paper
PaLM: Scaling Language Modeling with Pathways
2022cited by this paper
PASTA: Table-Operations Aware Fact Verification via Sentence-Table Cloze Pre-training
2022cited by this paper
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
2022cited by this paper
Can Foundation Models Wrangle Your Data?
2022cited by this paper
A Generalist Agent
2022cited by this paper
Deep Learning for Blocking in Entity Matching: A Design Space Exploration
2021cited by this paper
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
2021cited by this paper
Valentine: Evaluating Matching Techniques for Dataset Discovery
2020cited by this paper
Deep entity matching with pre-trained language models
2020influential reference
BERT-INT: A BERT-based Interaction Model For Knowledge Graph Alignment
2020cited by this paper
Language Models are Few-Shot Learners
2020cited by this paper
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
2020cited by this paper
Multitask Mixture of Sequential Experts for User Activity Streams
2020cited by this paper
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
2020cited by this paper
An Overview of End-to-End Entity Resolution for Big Data
2020cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020cited by this paper
A Review of Generalized Zero-Shot Learning Methods
2020cited by this paper
Learning Semantic Annotations for Tabular Data
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
RoBERTa: A Robustly Optimized BERT Pretraining Approach
2019cited by this paper
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
2019cited by this paper
A Survey of Zero-Shot Learning
2019cited by this paper
Deep Learning for Entity Matching: A Design Space Exploration
2018cited by this paper
Smurf: Self-Service String Matching Using Random Forests
2018cited by this paper
Record Linkage
2018cited by this paper
Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts
2018cited by this paper
A Survey on Multi-Task Learning
2017cited by this paper
Distributed Representations of Tuples for Entity Resolution
2017cited by this paper
Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings
2017cited by this paper
Attention is All you Need
2017cited by this paper
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
2017influential reference
The Data Civilizer System
2017cited by this paper
Principles of Data Integration
2012cited by this paper
A unified architecture for natural language processing: deep neural networks with multitask learning
2008influential reference
Adaptive Mixtures of Local Experts
1991cited by this paper
The VLDB Journal manuscript No. (will be inserted by the editor) Learning to Match Ontologies on the Semantic Web
year unknowncited by this paper

CITED BY

LakeHopper: Cross Data Lakes Column Type Annotation through Model Adaptation
2026cites this paper
Progressive Entity Matching: A Design Space Exploration
2025cites this paper
Taxonomy Inference for Tabular Data Using Large Language Models
2025cites this paper
A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online Marketplaces
2025cites this paper
Towards uncertainty-calibrated structural data enrichment with large language model for few-shot entity resolution
2025cites this paper
LLM/Agent-as-Data-Analyst: A Survey
2025cites this paper
Sustainable Quality in Data Preparation
2025cites this paper
How do Language Models Reshape Entity Alignment? A Survey of LM-Driven EA Methods: Advances, Benchmarks, and Future
2025influential citation
PUER: Boosting Few-shot Positive-Unlabeled Entity Resolution with Reinforcement Learning
2025influential citation
HILTS: Human-LLM collaboration for effective data labeling
2025cites this paper
Entity Matching in the Era of Language Models: A Structured Literature Review
2025cites this paper
Efficient Model Repository for Entity Resolution: Construction, Search, and Integration
2024cites this paper