CharBERT: Character-aware Pre-trained Language Model

Wentao Ma,Yiming Cui,Chenglei Si,Ting Liu,Shijin Wang,Guoping Hu

Published 2020 in International Conference on Computational Linguistics

ABSTRACT

Most pre-trained language models (PLMs) construct word representations at subword level with Byte-Pair Encoding (BPE) or its variations, by which OOV (out-of-vocab) words are almost avoidable. However, those methods split a word into subword units and make the representation incomplete and fragile.In this paper, we propose a character-aware pre-trained language model named CharBERT improving on the previous methods (such as BERT, RoBERTa) to tackle these problems. We first construct the contextual word embedding for each token from the sequential character representations, then fuse the representations of characters and the subword representations by a novel heterogeneous interaction module. We also propose a new pre-training task named NLM (Noisy LM) for unsupervised character representation learning. We evaluate our method on question answering, sequence labeling, and text classification tasks, both on the original datasets and adversarial misspelling test sets. The experimental results show that our method can significantly improve the performance and robustness of PLMs simultaneously.

PUBLICATION RECORD

Publication year
2020
Venue
International Conference on Computational Linguistics
Publication date
2020-11-03
Fields of study
Computer Science
Identifiers
DOI 10.18653/v1/2020.coling-main.4 arXiv 2011.01513
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Benchmarking Robustness of Machine Reading Comprehension Models
2020cited by this paper
Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency
2019cited by this paper
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
2019cited by this paper
Semantics-aware BERT for Language Understanding
2019cited by this paper
Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019influential reference
ERNIE: Enhanced Language Representation with Informative Entities
2019cited by this paper
Language Models are Unsupervised Multitask Learners
2019cited by this paper
Combating Adversarial Misspellings with Robust Word Recognition
2019cited by this paper
XLNet: Generalized Autoregressive Pretraining for Language Understanding
2019cited by this paper
RoBERTa: A Robustly Optimized BERT Pretraining Approach
2019influential reference
Contextual String Embeddings for Sequence Labeling
2018influential reference
Neural Network Acceptability Judgments
2018cited by this paper
Character-Level Models versus Morphology in Semantic Role Labeling
2018cited by this paper
Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings
2018cited by this paper
Know What You Don’t Know: Unanswerable Questions for SQuAD
2018influential reference
Improving Language Understanding by Generative Pre-Training
2018cited by this paper
Deep Contextualized Word Representations
2018cited by this paper
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
2018influential reference
Generating Natural Language Adversarial Examples
2018cited by this paper
On Adversarial Examples for Character-Level Neural Machine Translation
2018influential reference
Learned in Translation: Contextualized Word Vectors
2017cited by this paper
Synthetic and Natural Noise Both Break Neural Machine Translation
2017cited by this paper
Adversarial Examples for Evaluating Reading Comprehension Systems
2017cited by this paper
Gaussian Error Linear Units (GELUs)
2016cited by this paper
Bidirectional Attention Flow for Machine Comprehension
2016cited by this paper
Words or Characters? Fine-grained Gating for Reading Comprehension
2016cited by this paper
Layer Normalization
2016cited by this paper
Fully Character-Level Neural Machine Translation without Explicit Segmentation
2016cited by this paper
SQuAD: 100,000+ Questions for Machine Comprehension of Text
2016influential reference
Neural Machine Translation of Rare Words with Subword Units
2015cited by this paper
Convolutional Neural Networks for Sentence Classification
2014cited by this paper
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
2014cited by this paper
Generating Text with Recurrent Neural Networks
2011cited by this paper
Automatically Constructing a Corpus of Sentential Paraphrases
2005cited by this paper
Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
2003influential reference
Word-level Textual Adversarial Attacking as Combinatorial Optimization
year unknowncited by this paper

CITED BY

Aviation accident report causality extraction based on transformer with BERT-embeddings
2026cites this paper
DeFiAD: A Unified Method for Early-Stage Domain Abuse Detection Through Automated Deep Feature Interaction
2025cites this paper
Note-Bert: Medical report classification method based on feature fusion and memory enhancement
2025cites this paper
MFFURL: Multi-modal feature fusion-based approach for malicious URL detection
2025cites this paper
Local and Global: Multi-View Modeling for Duplicate Question Detection
2025cites this paper
IP-Augmented Multi-Modal Malicious URL Detection Via Token-Contrastive Representation Enhancement and Multi-Granularity Fusion
2025cites this paper
Decopy: detect and correct with pinyin for Chinese spelling correction
2025cites this paper
OCR-Assisted Masked BERT for Homoglyph Restoration towards Multiple Phishing Text Downstream Tasks
2025cites this paper
chDzDT: Word-level morphology-aware language model for Algerian social media text
2025cites this paper
Robust scene text understanding with OCR token and word alignment for Text-VQA and text-caption
2025cites this paper
TASE: Token Awareness and Structured Evaluation for Multilingual Language Models
2025cites this paper
Tabular context-aware optical character recognition and tabular data reconstruction for historical records
2025cites this paper
SENA: Leveraging set-level consistency adversarial learning for robust pre-trained language model adaptation
2025cites this paper
Big Five Personality Trait Prediction Based on User Comments
2025cites this paper
Using External knowledge to Enhanced PLM for Semantic Matching
2025cites this paper
Mitigating Forgetting in Adapting Pre-trained Language Models to Text Processing Tasks via Consistency Alignment
2025cites this paper
CARLDA: An Approach for Stack Overflow API Mention Recognition Driven by Context and LLM‐Based Data Augmentation
2025cites this paper
AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation
2025cites this paper
KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
2025influential citation
SPSY: a semantic synthesis framework for lexical sememe prediction and its applications
2025cites this paper
EMK-KEN: A High-Performance Approach for Assessing Knowledge Value in Citation Network
2025cites this paper
LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems
2025cites this paper
Enhancing zero-shot relation extraction with a dual contrastive learning framework and a cross-attention module
2024cites this paper
BanglaTLit: A Benchmark Dataset for Back-Transliteration of Romanized Bangla
2024influential citation
TempCharBERT: Keystroke Dynamics for Continuous Access Control Based on Pre-trained Language Models
2024cites this paper
From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes
2024cites this paper
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
2024cites this paper
Comateformer: Combined Attention Transformer for Semantic Sentence Matching
2024cites this paper
QuickCharNet: An Efficient URL Classification Framework for Enhanced Search Engine Optimization
2024cites this paper
Continuous multi-task pre-training for malicious URL detection and webpage classification
2024cites this paper
Knowledge of Pretrained Language Models on Surface Information of Tokens
2024cites this paper
DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification
2024cites this paper
Word and Character Semantic Fusion by Pretrained Language Models for Text Classification
2024cites this paper
Product Matching with Two-Branch Neural Network Embedding
2024cites this paper
Advancing Post-OCR Correction: A Comparative Study of Synthetic Data
2024influential citation
MambaByte: Token-free Selective State Space Model
2024cites this paper
Beyond Binary Gender Labels: Revealing Gender Bias in LLMs through Gender-Neutral Name Predictions
2024cites this paper
KEHRL: Learning Knowledge-Enhanced Language Representations with Hierarchical Reinforcement Learning
2024cites this paper
Large Language Models for Cyber Security: A Systematic Literature Review
2024cites this paper
DePNR: A DeBERTa‐based deep learning model with complete position embedding for place name recognition from geographical literature
2024cites this paper
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language Understanding
2024cites this paper
Adversarial Training with OCR modality Perturbation for Scene-Text Visual Question Answering
2024cites this paper
Analysis of Pre-trained Language Models in Text Classification for Use in Spanish Medical Records Anonymization
2023cites this paper
Explainability-Based Mix-Up Approach for Text Data Augmentation
2023influential citation
READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input Noises
2023influential citation
Elementwise Language Representation
2023cites this paper
IPA-CLIP: Integrating Phonetic Priors into Vision and Language Pretraining
2023cites this paper
Generation-based Code Review Automation: How Far Are Weƒ
2023cites this paper
From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding
2023influential citation
People and Places of Historical Europe: Bootstrapping Annotation Pipeline and a New Corpus of Named Entities in Late Medieval Texts
2023cites this paper
SCAT: Robust Self-supervised Contrastive Learning via Adversarial Training for Text Classification
2023cites this paper
GeoTPE: A neural network model for geographical topic phrases extraction from literature based on BERT enhanced with relative position embedding
2023cites this paper
Enhancing OCR Performance through Post-OCR Models: Adopting Glyph Embedding for Improved Correction
2023cites this paper
Optimized Tokenization for Transcribed Error Correction
2023cites this paper
An Empirical Analysis Towards Replacing Vocabulary-Rigid Embeddings by a Vocabulary-Free Mechanism
2023cites this paper
Text Rendering Strategies for Pixel Language Models
2023cites this paper
Learning Mutually Informed Representations for Characters and Subwords
2023cites this paper
PyraTrans: Attention-Enriched Pyramid Transformer for Malicious URL Detection
2023cites this paper
Consonant is all you need: a compact representation of English text for efficient NLP
2023cites this paper
A Study on the Relevance of Generic Word Embeddings for Sentence Classification in Hepatic Surgery
2023cites this paper
Aligning Word Embeddings from BERT to Vocabulary-Free Representations
2023cites this paper
TransURL: Improving malicious URL detection with multi-layer Transformer encoding and multi-scale pyramid features
2023cites this paper
PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network
2023cites this paper
KinyaBERT: a Morphology-aware Kinyarwanda Language Model
2022cites this paper
Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models
2022cites this paper
Imputing Out-of-Vocabulary Embeddings with LOVE Makes LanguageModels Robust with Little Cost
2022cites this paper
Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages
2022cites this paper
Artificial Intelligence for the Metaverse: A Survey
2022cites this paper
An Assessment of the Impact of OCR Noise on Language Models
2022cites this paper
A Survey of Pretrained Language Models Based Text Generation
2022cites this paper
Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation
2022cites this paper
CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog Evaluation
2022cites this paper
Adapting vs. Pre-training Language Models for Historical Languages
2022cites this paper
Local Byte Fusion for Neural Machine Translation
2022cites this paper
Distributed Text Representations Using Transformers for Noisy Written Language
2022cites this paper
MockingBERT: A Method for Retroactively Adding Resilience to NLP Models
2022cites this paper
MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling
2022cites this paper
Sense-aware BERT and Multi-task Fine-tuning for Multimodal Sentiment Analysis
2022cites this paper
From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA
2022cites this paper
A review on abusive content automatic detection: approaches, challenges and opportunities
2022cites this paper
Learning Chinese Word Embeddings By Discovering Inherent Semantic Relevance in Sub-characters
2022cites this paper
Proactive Detection of Query-based Adversarial Scenarios in NLP Systems
2022cites this paper
Continuous Prompt Tuning Based Textual Entailment Model for E-commerce Entity Typing
2022cites this paper
Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding
2022cites this paper
Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the Research Manifold
2022cites this paper
Interpreting Character Embeddings With Perceptual Representations: The Case of Shape, Sound, and Color
2022cites this paper
Down and Across: Introducing Crossword-Solving as a New NLP Benchmark
2022cites this paper
Word-Level Representation From Bytes For Language Modeling
2022cites this paper
Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Processing
2022cites this paper
Language Modelling with Pixels
2022cites this paper
Simulation d’erreurs d’OCR dans les systèmes de TAL pour le traitement de données anachroniques (Simulation of OCR errors in NLP systems for processing anachronistic data)
2022cites this paper
CancerBERT: a cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records
2022cites this paper
Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT
2022cites this paper
Integrating Approaches to Word Representation
2021cites this paper
AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing
2021influential citation
LadRa-Net: Locally Aware Dynamic Reread Attention Net for Sentence Semantic Matching
2021cites this paper
Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information
2021cites this paper
Intérêt des modèles de caractères pour la détection d’événements (The interest of character-level models for event detection)
2021cites this paper
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
2021cites this paper
Evaluating Various Tokenizers for Arabic Text Classification
2021cites this paper