Pay Less Attention with Lightweight and Dynamic Convolutions

Felix Wu,Angela Fan,Alexei Baevski,Yann Dauphin,Michael Auli

Published 2019 in International Conference on Learning Representations

ABSTRACT

Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT'14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.

PUBLICATION RECORD

Publication year
2019
Venue
International Conference on Learning Representations
Publication date
2019-01-29
Fields of study
Computer Science
Identifiers
arXiv 1901.10430
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Deep Communicating Agents for Abstractive Summarization
2018cited by this paper
Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling
2018cited by this paper
Fast Decoding in Sequence Models using Discrete Latent Variables
2018cited by this paper
Smoothed dilated convolutions for improved dense prediction
2018cited by this paper
Convolutional Interaction Network for Natural Language Inference
2018cited by this paper
Latent Alignment and Variational Attention
2018cited by this paper
Scaling Neural Machine Translation
2018influential reference
Bottom-Up Abstractive Summarization
2018cited by this paper
Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement
2018cited by this paper
Accelerating Neural Transformer via an Average Attention Network
2018cited by this paper
Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures
2018cited by this paper
Fast Directional Self-Attention Mechanism
2018cited by this paper
Hint-based Training for Non-Autoregressive Translation
2018cited by this paper
Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input
2018cited by this paper
Generating Wikipedia by Summarizing Long Sequences
2018cited by this paper
Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
2018cited by this paper
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation
2018cited by this paper
Self-Attention with Relative Position Representations
2018cited by this paper
Achieving Human Parity on Automatic Chinese to English News Translation
2018influential reference
Classical Structured Prediction Losses for Sequence to Sequence Learning
2017influential reference
Controllable Abstractive Summarization
2017cited by this paper
A Deep Reinforced Model for Abstractive Summarization
2017cited by this paper
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
2017cited by this paper
Attention is All you Need
2017influential reference
DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding
2017cited by this paper
Non-Autoregressive Neural Machine Translation
2017cited by this paper
Depthwise Separable Convolutions for Neural Machine Translation
2017influential reference
Convolutional Sequence to Sequence Learning
2017cited by this paper
Regularizing Neural Networks by Penalizing Confident Output Distributions
2017cited by this paper
Weighted Transformer Network for Machine Translation
2017cited by this paper
Learning Context-Sensitive Convolutional Filters for Text Processing
2017cited by this paper
Get To The Point: Summarization with Pointer-Generator Networks
2017cited by this paper
Neural Machine Translation in Linear Time
2016cited by this paper
Using the Output Embedding to Improve Language Models
2016cited by this paper
Xception: Deep Learning with Depthwise Separable Convolutions
2016cited by this paper
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
2016cited by this paper
Efficient softmax approximation for GPUs
2016cited by this paper
A Convolutional Encoder Model for Neural Machine Translation
2016cited by this paper
Layer Normalization
2016cited by this paper
Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond
2016cited by this paper
Exploring the Limits of Language Modeling
2016cited by this paper
SGDR: Stochastic Gradient Descent with Warm Restarts
2016cited by this paper
Language Modeling with Gated Convolutional Networks
2016influential reference
Deep Residual Learning for Image Recognition
2015cited by this paper
Distilling the Knowledge in a Neural Network
2015cited by this paper
Locally-connected and convolutional neural networks for small footprint speaker recognition
2015cited by this paper
Teaching Machines to Read and Comprehend
2015cited by this paper
Effective Approaches to Attention-based Neural Machine Translation
2015cited by this paper
Attention-Based Models for Speech Recognition
2015cited by this paper
End-To-End Memory Networks
2015cited by this paper
Neural Machine Translation of Rare Words with Subword Units
2015cited by this paper
Rethinking the Inception Architecture for Computer Vision
2015cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014influential reference
Adam: A Method for Stochastic Optimization
2014cited by this paper
Sequence to Sequence Learning with Neural Networks
2014influential reference
DeepFace: Closing the Gap to Human-Level Performance in Face Verification
2014cited by this paper
One billion word benchmark for measuring progress in statistical language modeling
2013cited by this paper
Regularization of Neural Networks using DropConnect
2013cited by this paper
On the importance of initialization and momentum in deep learning
2013cited by this paper
On the difficulty of training recurrent neural networks
2012cited by this paper
ROUGE: A Package for Automatic Evaluation of Summaries
2004cited by this paper
Gradient-based learning applied to document recognition
1998cited by this paper

CITED BY

Extraction of the liver falciform ligament and ridgeline in preoperative point clouds based on PointNet++ and transformer models
2026cites this paper
Product Interaction: An Algebraic Formalism for Deep Learning Architectures
2026cites this paper
ICSnet: An efficient object detection network for industrial complex scenes
2026cites this paper
Directional Reasoning Trajectory Change (DRTC): Identifying Critical Trace Segments in Reasoning Models
2026cites this paper
Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
2025cites this paper
Optimization model of small target detection for vehicle extraction in virtual simulation scenarios
2025cites this paper
An aerial point cloud classification using point transformer via multi-feature fusion
2025cites this paper
AI-Based Approaches for Brazilian Sign Language Recognition: A Systematic Literature Review: Insights on Methods, Metrics, and Resources for LIBRAS Recognition
2025cites this paper
GraphMind: Context-Aware Multi-Agent Systems With Graph Attention Autoencoder and Large Language Model Integration
2025cites this paper
From Emojis to Emotions: Abstractive Dialogue Summarization with Emotional Supervision Signals
2025cites this paper
A Chinese-Japanese Parallel Corpus for Document-level Neural Machine Translation Based on Web-Crawled News Data
2025cites this paper
KANDU-Net: Enhancing Global Context Capture in Medical Image Segmentation with Kolmogorov-Arnold Networks
2025cites this paper
Multi-axis compression fusion network for vehicle re-identification
2025cites this paper
Predicting academic performance for students’ university: case study from Saint Cloud State University
2025cites this paper
The FFT Strikes Back: An Efficient Alternative to Self-Attention
2025cites this paper
Neural Machine Translation for Agglutinative Languages via Data Rejuvenation
2025cites this paper
Stream-ViT: Learning Streamlined Convolutions in Vision Transformer
2025cites this paper
Don't Pay Attention
2025cites this paper
Chinese NER for UAV Fault Texts via Local–Global Joint Modeling and Diffusion-Based Semantic Denoising
2025influential citation
Pre-Training a Graph Recurrent Network for Text Understanding
2025cites this paper
Rethinking Natural Language Generation with Layer-Wise Multi-View Decoding
2025influential citation
EA-DETR: Edge-Aware Detection Transformer for Water Surface Floating Object Identification
2025cites this paper
A Comprehensive Survey on Transformer-Based Machine Translation: Identifying Research Gaps and Solutions for Large Language Models
2025cites this paper
Ionospheric TEC prediction using the non-stationary inverted transformer fusion model and its performance in Chinese region
2025cites this paper
Lite Mongolian-Chinese Neural Machine Translation: Dynamic Convolution with Long-Range Attention
2025cites this paper
Con-GBERT: Convolutional Attention-Based GBERT for Grapheme-to-Phoneme Conversion in Low-Resource Zhuang Language
2025cites this paper
Hybrid translation for sign languages: combining rule-based and neural machine translation in a low-resource scenario
2025cites this paper
Long Context Automated Essay Scoring with Language Models
2025cites this paper
Where to Add PDE Diffusion in Transformers
2025cites this paper
MGMA-PPIS: Predicting the protein–protein interaction site with multiview graph embedding and multiscale attention fusion
2025cites this paper
CDC-YOLOFusion: Leveraging Cross-Scale Dynamic Convolution Fusion for Visible-Infrared Object Detection
2025cites this paper
MPKD-DCFI: multi-path knowledge distillation via dynamic contextual feature interaction
2025cites this paper
Ensemble-Based Survival Models with the Self-Attended Beran Estimator Predictions
2025cites this paper
NN-Former: Rethinking Graph Structure in Neural Architecture Representation
2025cites this paper
DSRS: DELIGHT sequential recommender system
2025cites this paper
Revisiting Backdoor Attacks on Time Series Classification in the Frequency Domain
2025cites this paper
VortexTransformer: End‐to‐End Objective Vortex Detection in 2D Unsteady Flow Using Transformers
2025cites this paper
Dynamic convolution models for cross-frontend keyword spotting
2025cites this paper
Multi-Token Attention
2025cites this paper
An Empirical Evaluation of Encoder Architectures for Fast Real-Time Long Conversational Understanding
2025cites this paper
FedDGA: Federated Multitask Learning Based on Dynamic Guided Attention
2025cites this paper
Multi-refined Feature Enhanced Sentiment Analysis Using Contextual Instruction
2025cites this paper
Adversarial Attention Deficit: Fooling Deformable Vision Transformers with Collaborative Adversarial Patches
2025cites this paper
Dynamic Convolution and Transformer Based Dual-Branch Coding in Semantic Communication System
2025cites this paper
CDS-YOLO: a real-time infrared small target detection model
2025cites this paper
Topology-Aware Exploration of Energy-Based Models Equilibrium: Toric QC-LDPC Codes and Hyperbolic MET QC-LDPC Codes
2024cites this paper
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
2024cites this paper
Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition
2024cites this paper
Revisiting the Markov Property for Machine Translation
2024cites this paper
Improving Threat Mitigation Through a Cybersecurity Risk Management Framework: A Computational Design Science Approach
2024cites this paper
RawConvNet: An End to End Network for MI-EEG Decoding with Attention Mechanism and No Preprocessing
2024cites this paper
HPE-Li: WiFi-Enabled Lightweight Dual Selective Kernel Convolution for Human Pose Estimation
2024cites this paper
Overall Design and Physical Validation of Voice Interaction Based on the ChatGPT Humanoid Robot Brain
2024cites this paper
Dynamic convolution for image matching
2024cites this paper
Analysis of Blood Cell Image Recognition Methods Based on Improved CNN and Vision Transformer
2024cites this paper
Efficient Machine Translation with a BiLSTM-Attention Approach
2024cites this paper
Traffic signal current prediction algorithm based on CNN and LSTM
2024cites this paper
Enhancing Education for Deaf People: A Systematic Review of NLP Strategies for Automatic Translation From Portuguese to Brazilian Sign Language
2024cites this paper
Optical Flow as Spatial-Temporal Attention Learners
2024cites this paper
Mitigating Knowledge Conflicts in Data-to-Text Generation via the Internalization of Fact Extraction
2024cites this paper
PMF-SLAM: Pose-Guided and Multiscale Feature Interaction-Based Semantic SLAM for Autonomous Wheel Loader
2024cites this paper
AaDR-PointCloud: An integrated point cloud processing network using attention and deep residual
2024cites this paper
Scaling Up Your Kernels: Large Kernel Design in ConvNets Toward Universal Representations
2024cites this paper
Latent Semantic and Disentangled Attention
2024cites this paper
Deep Fuzzy Multiteacher Distillation Network for Medical Visual Question Answering
2024cites this paper
IEA-Net: Internal and External Dual-Attention Medical Segmentation Network with High-Performance Convolutional Blocks
2024cites this paper
big.LITTLE Vision Transformer for Efficient Visual Recognition
2024cites this paper
TransfoRhythm: A Transformer Architecture Conductive to Blood Pressure Estimation via Solo PPG Signal Capturing
2024cites this paper
Translation model based on discrete Fourier transform and Skipping Sub-Layer methods
2024cites this paper
A Novel and Efficient Framework for Diagnosing ECG Signals Based on the Digital Signal Processing and Optimized Transformer Model
2024cites this paper
Stereo-Knowledge Distillation from dpMV to Dual Pixels for Light Field Video Reconstruction
2024cites this paper
DDCTNet: A Deformable and Dynamic Cross-Transformer Network for Road Extraction From High-Resolution Remote Sensing Images
2024cites this paper
SMSTracker: A Self-Calibration Multi-Head Self-Attention Transformer for Visual Object Tracking
2024cites this paper
Joint features-guided linear transformer and CNN for efficient image super-resolution
2024cites this paper
Soul-Mix: Enhancing Multimodal Machine Translation with Manifold Mixup
2024cites this paper
GLULA: Linear attention-based model for efficient human activity recognition from wearable sensors
2024cites this paper
A dynamic attention mechanism for object detection in road or strip environments
2024cites this paper
Enhanced Channel-Temporal Transformer with Neural Architecture Search For Multivariate Time Series Forecasting
2024cites this paper
Multipath Attention and Adaptive Gating Network for Video Action Recognition
2024cites this paper
Bridging Visual Representation and Efficiency for Resource-Constrained Devices
2024cites this paper
Synthetic Data Generation and Joint Learning for Robust Code-Mixed Translation
2024cites this paper
Convolutions are competitive with transformers for protein sequence pretraining
2024cites this paper
SST: Multi-Scale Hybrid Mamba-Transformer Experts for Time Series Forecasting
2024cites this paper
On the Performance Analysis of Momentum Method: A Frequency Domain Perspective
2024cites this paper
Weakly Supervised Learning Method for Semantic Segmentation of Large-Scale 3D Point Cloud Based on Transformers
2024cites this paper
Crossing Linguistic Barriers: A Hybrid Attention Framework for Chinese-Arabic Machine Translation
2024cites this paper
TRELM: Towards Robust and Efficient Pre-training for Knowledge-Enhanced Language Models
2024cites this paper
OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation
2024cites this paper
A neural machine translation method based on split graph convolutional self-attention encoding
2024cites this paper
Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling
2024cites this paper
A dual-branch neural network for crop disease recognition by integrating frequency domain and spatial domain information
2024cites this paper
Personalized Cadence Awareness for Next Basket Recommendation
2024cites this paper
MambaOut: Do We Really Need Mamba for Vision?*
2024cites this paper
Memory Efficient Neural Speech Synthesis Based on FastSpeech2 Using Attention Free Transformer
2024cites this paper
Stacking Diverse Architectures to Improve Machine Translation
2023influential citation
Gradient-based Gradual Pruning for Language-Specific Multilingual Neural Machine Translation
2023cites this paper
Gated Linear Attention Transformers with Hardware-Efficient Training
2023cites this paper
Sentiment Analysis on Streaming User Reviews via Dual-Channel Dynamic Graph Neural Network
2023cites this paper
Extracting long‐term spatiotemporal characteristics of traffic flow using attention‐based convolutional transformer
2023cites this paper
Heterogeneous Encoders Scaling in the Transformer for Neural Machine Translation
2023cites this paper