Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

Raphael Tang,Yao Lu,Linqing Liu,Lili Mou,Olga Vechtomova,Jimmy J. Lin

Published 2019 in arXiv.org

ABSTRACT

In the natural language processing literature, neural networks are becoming increasingly deeper and complex. The recent poster child of this trend is the deep language representation model, which includes BERT, ELMo, and GPT. These developments have led to the conviction that previous-generation, shallower neural networks for language understanding are obsolete. In this paper, however, we demonstrate that rudimentary, lightweight neural networks can still be made competitive without architecture changes, external training data, or additional input features. We propose to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM, as well as its siamese counterpart for sentence-pair tasks. Across multiple datasets in paraphrasing, natural language inference, and sentiment classification, we achieve comparable results with ELMo, while using roughly 100 times fewer parameters and 15 times less inference time.

PUBLICATION RECORD

Publication year
2019
Venue
arXiv.org
Publication date
2019-03-28
Fields of study
Linguistics, Computer Science
Identifiers
arXiv 1903.12136
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Assessing BERT's Syntactic Abilities
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019influential reference
AllenNLP: A Deep Semantic Natural Language Processing Platform
2018influential reference
On-Device Neural Language Model Based Word Prediction
2018cited by this paper
Constraint-Aware Deep Neural Network Compression
2018cited by this paper
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
2018influential reference
Deep Contextualized Word Representations
2018cited by this paper
Training and Inference with Integers in Deep Neural Networks
2018cited by this paper
FLOPs as a Direct Optimization Objective for Learning Sparse Neural Networks
2018cited by this paper
Bilateral Multi-Perspective Matching for Natural Language Sentences
2017cited by this paper
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
2017cited by this paper
Efficient methods and hardware for deep learning
2017cited by this paper
Learning Efficient Convolutional Networks through Network Slimming
2017cited by this paper
Very Deep Convolutional Networks for Text Classification
2016cited by this paper
Sequence-Level Knowledge Distillation
2016cited by this paper
Very Deep Convolutional Networks for Natural Language Processing
2016cited by this paper
The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages
2016cited by this paper
Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling
2016cited by this paper
Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus
2016cited by this paper
Pruning Filters for Efficient ConvNets
2016cited by this paper
Binarized Neural Networks
2016cited by this paper
Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement
2016cited by this paper
UMD-TTIC-UW at SemEval-2016 Task 1: Attention-Based Multi-Perspective Convolutional Neural Networks for Textual Similarity Measurement
2016cited by this paper
Character-level Convolutional Networks for Text Classification
2015cited by this paper
Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding
2015cited by this paper
Distilling the Knowledge in a Neural Network
2015cited by this paper
A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations
2015cited by this paper
Convolutional Neural Networks for Sentence Classification
2014influential reference
A Convolutional Neural Network for Modelling Sentences
2014cited by this paper
Generating Sequences With Recurrent Neural Networks
2013cited by this paper
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
2013cited by this paper
Do Deep Nets Really Need to be Deep?
2013influential reference
ADADELTA: An Adaptive Learning Rate Method
2012cited by this paper
Parsing Natural Scenes and Natural Language with Recursive Neural Networks
2011cited by this paper
Extensions of recurrent neural network language model
2011cited by this paper
Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Networks
2010cited by this paper
Recurrent neural network based language model
2010cited by this paper
Learning Multiple Layers of Features from Tiny Images
2009cited by this paper
Optimal Brain Damage
1989cited by this paper

CITED BY

Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better
2026cites this paper
Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs
2026cites this paper
Federated Customization of Large Models: Approaches, Experiments, and Insights
2026cites this paper
Trade-offs in Ensembling, Merging and Routing Among Parameter-Efficient Experts
2026cites this paper
Knowledge Distillation in Object Detection: A Survey from CNN to Transformer
2026cites this paper
MiCoTA: Bridging the Learnability Gap with Intermediate CoT and Teacher Assistants
2025cites this paper
MixLM: High-Throughput and Effective LLM Ranking via Text-Embedding Mix-Interaction
2025cites this paper
ARC: A Layer Replacement Compression Method Based on Fine-Grained Self-Attention Distillation for Compressing Pre-Trained Language Models
2025cites this paper
LLM on a Budget: Active Knowledge Distillation for Efficient Classification of Large Text Corpora
2025cites this paper
GBC: Generalized Behavior-Cloning Framework for Whole-Body Humanoid Imitation
2025cites this paper
TMLKD: Few-shot Trajectory Metric Learning via Knowledge Distillation
2025cites this paper
A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?
2025cites this paper
ALM-KD: Adaptive Layer Mapping Knowledge Distillation for LLMs
2025cites this paper
SPENCER: Self-Adaptive Model Distillation for Efficient Code Retrieval
2025cites this paper
Optimized Domain-Specific Text Processing with Keyword Knowledge Distillation (KKD)
2025cites this paper
General lightweight framework for vision foundation model supporting multi-task and multi-center medical image analysis
2025cites this paper
Automated Skin Cancer Report Generation via a Knowledge-Distilled Vision-Language Model
2025cites this paper
Toward Federated Large Language Models: Motivations, Methods, and Future Directions
2025cites this paper
Mitigating Catastrophic Forgetting in the Incremental Learning of Medical Images
2025cites this paper
TwT: Thinking without Tokens by Habitual Reasoning Distillation with Multi-Teachers' Guidance
2025cites this paper
Tractable Representation Learning with Probabilistic Circuits
2025cites this paper
Self-Correction Distillation for Structured Data Question Answering
2025cites this paper
A Survey of Adaptation of Large Language Models to Idea and Hypothesis Generation: Downstream Task Adaptation, Knowledge Distillation Approaches and Challenges
2025cites this paper
Distillation-Enhanced Clustering Acceleration for Encrypted Traffic Classification
2025cites this paper
EoML-SlideNet: A Lightweight Framework for Landslide Displacement Forecasting with Multi-Source Monitoring Data
2025cites this paper
PIONEER: improving the robustness of student models when compressing pre-trained models of code
2025cites this paper
EdgeTA: Neuron-Grained Scaling of Foundation Models in Edge-Side Retraining
2025cites this paper
Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models
2025cites this paper
Cellular Automaton-Based Sentiment Analysis for Bipolar Classification of Reviews
2025cites this paper
A Developer’s Guide to Compressing Pre-Trained Transformer Neural Networks Across Different Domains
2025cites this paper
EEG Lightformer-KD: a Lightweight Framework for Motor Imagery EEG Decoding via Knowledge Distillation
2025cites this paper
Model Compression vs. Adversarial Robustness: An Empirical Study on Language Models for Code
2025cites this paper
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving
2025cites this paper
Honey, I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability
2025cites this paper
Multi-stage Training of Bilingual Islamic LLM for Neural Passage Retrieval
2025cites this paper
Deriving Coding-Specific Sub-Models from LLMs using Resource-Efficient Pruning
2025cites this paper
TinyML and edge intelligence applications in cardiovascular disease: A survey
2025cites this paper
Advancing Chatbot Conversations: A Review of Knowledge Update Approaches
2024cites this paper
Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation
2024cites this paper
Xai-driven knowledge distillation of large language models for efficient deployment on low-resource devices
2024cites this paper
Semantic Ranking for Automated Adversarial Technique Annotation in Security Text
2024cites this paper
Multimodality Self-distillation for Fast Inference of Vision and Language Pretrained Models
2024cites this paper
Target-Embedding Autoencoder With Knowledge Distillation for Multi-Label Classification
2024cites this paper
DISKCO : Disentangling Knowledge from Cross-Encoder to Bi-Encoder
2024cites this paper
Teaching MLP More Graph Information: A Three-stage Multitask Knowledge Distillation Framework
2024cites this paper
ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models
2024cites this paper
Model Compression and Efficient Inference for Large Language Models: A Survey
2024cites this paper
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
2024cites this paper
Generation, Distillation and Evaluation of Motivational Interviewing-Style Reflections with a Foundational Language Model
2024cites this paper
Enhancing Customer Service in Banking with AI: Intent Classification Using Distilbert
2024cites this paper
ELAD: Explanation-Guided Large Language Models Active Distillation
2024cites this paper
A Survey on Transformer Compression
2024cites this paper
TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale
2024cites this paper
Processing Natural Language on Embedded Devices: How Well Do Modern Models Perform?
2024cites this paper
Compacting Language Model for Natural Language Understanding on English Datasets
2024influential citation
Leveraging Knowledge Distillation for Improved Event Extraction in QA Models
2024cites this paper
Zero-Shot Visual Sentiment Prediction via Cross-Domain Knowledge Distillation
2024cites this paper
Task Integration Distillation for Object Detectors
2024cites this paper
Multi-Knowledge Distillation for Constructing Math Word Problem Encoder
2024cites this paper
Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding
2024cites this paper
An Efficient Training Framework for Chinese Text Classification: Integrating Large Language Models and Reinforcement Learning
2024cites this paper
Enhancing Medical Dialogue Summarization: A MediExtract Distillation Framework
2024cites this paper
Sequoia: Scalable and Robust Speculative Decoding
2024cites this paper
Extracting Interpretable Task-Specific Circuits from Large Language Models for Faster Inference
2024cites this paper
Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models
2024cites this paper
Cellular Automaton-Based Sentiment Analysis Using Deep Learning Methods
2024cites this paper
SpectralKD: Understanding and Optimizing Vision Transformer Distillation through Spectral Analysis
2024cites this paper
Knowledge Distillation Framework of Pre-Trained Language Models Combined with Parameter Efficient Fine-Tuning
2024cites this paper
Building an Efficient Multilingual Non-Profit IR System for the Islamic Domain Leveraging Multiprocessing Design in Rust
2024cites this paper
SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation
2024cites this paper
Mixed Distillation Helps Smaller Language Models Reason Better
2024cites this paper
From Static to Dynamic: A Deeper, Faster, and Adaptive Language Modeling Approach
2024cites this paper
KGDist: A Prompt-Based Distillation Attack against LMs Augmented with Knowledge Graphs
2024cites this paper
DIMSIM: Distilled Multilingual Critics for Indic Text Simplification
2024cites this paper
Adaptive Ensemble Self-Distillation With Consistent Gradients for Fast Inference of Pretrained Language Models
2024cites this paper
Improving Pinterest Search Relevance Using Large Language Models
2024cites this paper
Confidence Preservation Property in Knowledge Distillation Abstractions
2024cites this paper
Larger models yield better results? Streamlined severity classification of ADHD-related concerns using BERT-based knowledge distillation
2024cites this paper
Dataset Distillation for Offline Reinforcement Learning
2024cites this paper
Visual Program Distillation with Template-Based Augmentation
2024cites this paper
Activation Sparsity Opportunities for Compressing General Large Language Models
2024cites this paper
Federated and edge learning for large language models
2024cites this paper
Deep Learning and Network Analysis: Classifying and Visualizing Geologic Hazard Reports
2024cites this paper
Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity
2024cites this paper
Probing the Pitfalls: Understanding SVD’s Shortcomings in Language Model Compression
2024cites this paper
Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application
2024cites this paper
A BERT-GNN Approach for Metastatic Breast Cancer Prediction Using Histopathology Reports
2024cites this paper
MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models
2024cites this paper
Self-Regulated Data-Free Knowledge Amalgamation for Text Classification
2024cites this paper
Analisis Sentimen Publik Terhadap Program Makan Siang Gratis Menggunakan BERT Neural Network Pada Platform X
2024cites this paper
Can't Hide Behind the API: Stealing Black-Box Commercial Embedding Models
2024cites this paper
Speedy Gonzales: A Collection of Fast Task-Specific Models for Spanish
2024influential citation
Croppable Knowledge Graph Embedding
2024cites this paper
VTrans: Accelerating Transformer Compression with Variational Information Bottleneck based Pruning
2024cites this paper
Demystifying Data Management for Large Language Models
2024cites this paper
Integrating Domain Knowledge for handling Limited Data in Offline RL
2024cites this paper
Dynamic distillation based multi-scale lightweight target detection
2024cites this paper
Need a Small Specialized Language Model? Plan Early!
2024cites this paper
Lexicon-Based Sentiment Analysis in Behavioral Research
2024cites this paper
Joint misalignment-aware bilateral detection network for human pose estimation in videos
2024cites this paper