Neural Discrete Representation Learning

Aäron van den Oord,O. Vinyals,K. Kavukcuoglu

Published 2017 in Neural Information Processing Systems

ABSTRACT

Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of "posterior collapse" -- where the latents are ignored when they are paired with a powerful autoregressive decoder -- typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.

PUBLICATION RECORD

Publication year
2017
Venue
Neural Information Processing Systems
Publication date
2017-11-02
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1711.00937
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

GENERATIVE ADVERSARIAL NETS
2018cited by this paper
Improved Variational Autoencoders for Text Modeling using Dilated Convolutions
2017cited by this paper
Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations
2017cited by this paper
Soft-to-Hard Vector Quantization for End-to-End Learned Compression of Images and Neural Networks
2017cited by this paper
Lossy Image Compression with Compressive Autoencoders
2017cited by this paper
Unsupervised Learning for Physical Interaction through Video Prediction
2016cited by this paper
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network
2016cited by this paper
InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
2016cited by this paper
Variational Inference for Monte Carlo Objectives
2016cited by this paper
Image-to-Image Translation with Conditional Adversarial Networks
2016cited by this paper
WaveNet: A Generative Model for Raw Audio
2016cited by this paper
Towards Conceptual Compression
2016influential reference
Pixel Recurrent Neural Networks
2016cited by this paper
Density estimation using Real NVP
2016cited by this paper
Semi-Supervised Learning with Context-Conditional Generative Adversarial Networks
2016cited by this paper
Improved Variational Inference with Inverse Autoregressive Flow
2016cited by this paper
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
2016cited by this paper
Video Pixel Networks
2016cited by this paper
The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
2016cited by this paper
Conditional Image Generation with PixelCNN Decoders
2016cited by this paper
Variational Lossy Autoencoder
2016cited by this paper
One-shot Learning with Memory-Augmented Neural Networks
2016cited by this paper
PixelVAE: A Latent Variable Model for Natural Images
2016cited by this paper
SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit
2016cited by this paper
Categorical Reparameterization with Gumbel-Softmax
2016cited by this paper
Variational Inference with Normalizing Flows
2015cited by this paper
Librispeech: An ASR corpus based on public domain audio books
2015cited by this paper
Importance Weighted Autoencoders
2015cited by this paper
Generating Sentences from a Continuous Space
2015cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
Stochastic Backpropagation and Approximate Inference in Deep Generative Models
2014cited by this paper
Neural Variational Inference and Learning in Belief Networks
2014cited by this paper
Show and tell: A neural image caption generator
2014cited by this paper
Efficient Learning of Domain-invariant Image Representations
2013cited by this paper
Deep AutoRegressive Networks
2013cited by this paper
Auto-Encoding Variational Bayes
2013cited by this paper
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
2013cited by this paper
A Spike and Slab Restricted Boltzmann Machine
2011cited by this paper
Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion
2010cited by this paper
Deep Boltzmann Machines
2009cited by this paper
Reducing the Dimensionality of Data with Neural
2008cited by this paper
Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks
2006cited by this paper
Reinforcement Learning: An Introduction
1998cited by this paper

CITED BY

SCAR-GS: Spatial Context Attention for Residuals in Progressive Gaussian Splatting
2026cites this paper
Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models
2026cites this paper
CCR-Net: Surpassing baselines in irony detection via supportive-conflictive attention fusion
2026cites this paper
FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching
2026cites this paper
Linear Complexity Self-Supervised Learning for Music Understanding with Random Quantizer
2026cites this paper
Multimodal Brownian bridge diffusion model for controllable synthetic medical image generation
2026cites this paper
Deep learning assessment of nativeness and pairing likelihood for antibody and nanobody design with AbNatiV2
2026cites this paper
Hybrid semantic segmentation with broad context and attention encoded network for urban street scenario
2026cites this paper
Learning Diffusion Policy from Primitive Skills for Robot Manipulation
2026cites this paper
When Tone and Words Disagree: Towards Robust Speech Emotion Recognition under Acoustic-Semantic Conflict
2026cites this paper
An Approach to Cross-Domain Recognition with Small Sample Data for Gear Fault Diagnosis
2026cites this paper
R2BD: A Reconstruction-Based Method for Generalizable and Efficient Detection of Fake Images
2026cites this paper
CoCoFR: Collaborative codebooks learning with soft matching strategy for blind face restoration.
2026cites this paper
You Only Transmit Once: Unified Generation and Comprehension for Efficient Semantic Communication
2026cites this paper
Fine-Detailed Facial Sketch-to-Photo Synthesis With Detail-Enhanced Codebook Priors
2026cites this paper
TDtoon: Two-stage diffusion for controllable cartoon video generation via sketch enhancement
2026cites this paper
SynthFed: Privacy-preserving long-tail ophthalmic diagnosis via VQ-VAE and GPT-augmented federated learning
2026cites this paper
Toward Reliable Multimodal Beam Prediction in mmWave Communications via Probabilistic Embedding and Uncertainty-Aware
2026cites this paper
Adaptability of Vision Foundation Models for 3D Medical Image Segmentation
2026cites this paper
Improving Flexible Image Tokenizers for Autoregressive Image Generation
2026cites this paper
Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test
2026cites this paper
TokenSeg: Efficient 3D Medical Image Segmentation via Hierarchical Visual Token Compression
2026influential citation
Inference-Time Scaling for Visual AutoRegressive Modeling by Searching Representative Samples
2026cites this paper
SparseOccVLA: Bridging Occupancy and Vision-Language Models via Sparse Queries for Unified 4D Scene Understanding and Planning
2026cites this paper
Generating Distance-Aware Human-to-Human Interaction Motions From Text Guidance
2026cites this paper
Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation
2026cites this paper
PredLDM: Spatiotemporal Sequence Prediction with Latent Diffusion Models
2026cites this paper
PromptTrace: A Fine-Grained Prompt Stealing Attack via CLIP-Guided Beam Search for Text-to-Image Models
2026cites this paper
UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation
2026cites this paper
Variational and Generative Models With Quantization for Disentanglement and Compressed Sensing of GNSS Spectrograms
2026cites this paper
Deep Clustering with Associative Memories
2026cites this paper
Lightweight Digital Semantic Communication Based on DeepReceiver
2026cites this paper
Regression in Earth Observation: Are vision–language models up to the challenge?
2026cites this paper
GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure
2026cites this paper
SnapSound: Empowering everyone to customize sound experience with Generative AI
2026cites this paper
FQCDM: Feature quantization-based cardiac image diffusion synthesis model
2026cites this paper
Generative Model for 2.5D-Assisted Future Urban Remote Sensing Image Synthesis
2026cites this paper
PC-NSVC: An End-to-End Neural Scalable Vibrotactile Codec With Psychohaptic Calibration
2026cites this paper
Neural Audio Synthesis for Sound Effects: A Scope Review
2026cites this paper
Causal-ESC : Capture the Dynamics in Cause-and-Effect Detection for Emotional Support Conversation
2026cites this paper
Memory Bank Compression for Continual Adaptation of Large Language Models
2026cites this paper
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
2026cites this paper
VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
2026cites this paper
GeoDiff-SAR: A Geometric Prior Guided Diffusion Model for SAR Image Generation
2026cites this paper
CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos
2026cites this paper
A Vision for Multisensory Intelligence: Sensing, Science, and Synergy
2026cites this paper
VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction
2026cites this paper
GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models
2026cites this paper
Video Generation Models in Robotics - Applications, Research Challenges, Future Directions
2026cites this paper
Object-Centric World Models Meet Monte Carlo Tree Search
2026cites this paper
Breaking through data scarcity: A novel diffusion model approach for snoring sound augmentation and classification
2026cites this paper
Facial appearance prediction for orthognathic surgery with diffusion models.
2026cites this paper
Decoding Order Matters in Autoregressive Speech Synthesis
2026cites this paper
Diagnosing and Improving Vector-Quantization-Based Blind Image Restoration
2026influential citation
VQ-Seg: Vector-Quantized Token Perturbation for Semi-Supervised Medical Image Segmentation
2026cites this paper
HeartMuLa: A Family of Open Sourced Music Foundation Models
2026cites this paper
Deep learning-enabled rapid inverse design of high-performance meta-atoms
2026cites this paper
Improving precipitation nowcasting via multiphysical parameter fusion in radar echo extrapolation
2026cites this paper
TimeMar: Multi-Scale Autoregressive Modeling for Unconditional Time Series Generation
2026cites this paper
ATATA: One Algorithm to Align Them All
2026cites this paper
Democratizing planetary-scale analysis: An ultra-lightweight Earth embedding database for accurate and flexible global land monitoring
2026cites this paper
R3VQ: Redundancy-Reduced Residual Vector Quantization for Low-Bitrate Neural Speech Coding
2026influential citation
DiViCo: Disentangled Visual Token Compression for Efficient Large Vision-Language Model
2026cites this paper
CtD: Composition through Decomposition in Emergent Communication
2026influential citation
Deep Variational Autoencoder-Based Parameter Learning of Bayesian Network With Multiple Latent Variables
2026cites this paper
CPIG: Leveraging Consistency Policy With Intention Guidance for Multiagent Exploration
2026cites this paper
Knowledge Base Autoencoder Framework: A Novel Approach for Continuous Phase Shift Compression in RIS-Aided Communications
2026cites this paper
Prompts Libra: Enhanced Image Outpainting Diffusion Model With Balanced Bimodal Guidance
2026cites this paper
Cooperative ISAC for Joint Localization and Velocity Estimation in Cell-Free MIMO Systems
2026influential citation
Taming Learnable Codebook Design and Modulation for Digital Semantic Image Communication
2026cites this paper
Amplifying discriminative distortions: A generative latent feature reinforcement framework for audio spoofing detection
2026cites this paper
Vision-Language Models Unlock Task-Centric Latent Actions
2026cites this paper
A Survey on Deep Generative Models for Robot Learning From Multimodal Demonstrations
2026cites this paper
DAS-Accelerometer Data Fusion With Semi-Supervised Graph Variational Autoencoder for In-Service Train Wheel Flat Detection
2026cites this paper
Mixture of emotions: Global-to-local emotion representation extraction for emotion recognition in conversation
2026cites this paper
Clustered Federated Learning to Support Context-Dependent CSI Decoding
2026cites this paper
Gram: A Large General EEG Model for Raw Data Classification and Restoration
2026cites this paper
Physics-Aware Multichannel Vector Quantization for Hybrid Digital Twin Modeling of UAV Systems
2026cites this paper
Conditional Entropy-Constrained Multi-Stage Vector Quantization for Semantic Communication
2026influential citation
VQ-DeepVSC: A Dual-Stage Vector Quantization System for Video Semantic Communication
2026cites this paper
High-Capacity Image Steganography via Latent Diffusion Models
2026cites this paper
Fault diagnosis in rotating machinery with discretized signal representation leveraging large language models
2026cites this paper
Generative AI-Enabled Semantic Communication: State-of-the-Art, Applications, and the Way Ahead
2026cites this paper
Generative modeling for mid-term probabilistic load forecasting based on latent diffusion
2026cites this paper
LooC: Effective Low-Dimensional Codebook for Compositional Vector Quantization
2026influential citation
Categorical Reparameterization with Denoising Diffusion models
2026cites this paper
Achieving Fine-grained Cross-modal Understanding through Brain-inspired Hierarchical Representation Learning
2026cites this paper
ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors
2026cites this paper
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
2026cites this paper
IndexTTS 2.5 Technical Report
2026cites this paper
ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation
2026influential citation
Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control
2026cites this paper
Prompting Underestimates LLM Capability for Time Series Classification
2026cites this paper
Learning Latent Action World Models In The Wild
2026cites this paper
CHDP: Cooperative Hybrid Diffusion Policies for Reinforcement Learning in Parameterized Action Space
2026cites this paper
SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving
2026cites this paper
LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction
2026cites this paper
Vector Quantized-Aided XL-MIMO CSI Feedback with Channel Adaptive Transmission
2026cites this paper
SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis
2026influential citation
FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions
2026cites this paper