Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Jesse Engel,Cinjon Resnick,Adam Roberts,S. Dieleman,Mohammad Norouzi,D. Eck,K. Simonyan

Published 2017 in International Conference on Machine Learning

ABSTRACT

Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are realistic and expressive.

PUBLICATION RECORD

Publication year
2017
Venue
International Conference on Machine Learning
Publication date
2017-04-05
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1704.01279
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

GENERATIVE ADVERSARIAL NETS
2018cited by this paper
Improved Techniques for Training GANs
2016influential reference
Sampling Generative Networks Notes on a Few Effective Techniques
2016cited by this paper
Pattern Recognition And Machine Learning
2016cited by this paper
WaveNet: A Generative Model for Raw Audio
2016influential reference
Pixel Recurrent Neural Networks
2016influential reference
Learning Features of Music from Scratch
2016cited by this paper
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
2016cited by this paper
Variational Lossy Autoencoder
2016cited by this paper
PixelVAE: A Latent Variable Model for Natural Images
2016cited by this paper
Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching
2016cited by this paper
Analog Days The Invention And Impact Of The Moog Synthesizer
2016cited by this paper
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
2015cited by this paper
A real-time system for measuring sound goodness in instrumental sounds
2015cited by this paper
A note on the evaluation of generative models
2015cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
Musical Audio Synthesis Using Autoencoding Neural Nets
2014cited by this paper
Auto-Encoding Variational Bayes
2013cited by this paper
Reading Digits in Natural Images with Unsupervised Feature Learning
2011cited by this paper
The Million Song Dataset
2011cited by this paper
Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion
2010cited by this paper
CONSTANT-Q TRANSFORM TOOLBOX FOR MUSIC PROCESSING
2010cited by this paper
Learning Multiple Layers of Features from Tiny Images
2009cited by this paper
Sound texture synthesis via filter statistics
2009cited by this paper
ImageNet: A large-scale hierarchical image database
2009cited by this paper
The Blizzard Challenge 2008
2008cited by this paper
The mnist database of handwritten digits
2005cited by this paper
RWC Music Database: Music genre database and musical instrument sound database
2003cited by this paper
Analog Days: The Invention and Impact of the Moog Synthesizer
2002cited by this paper
Estimating and interpreting the instantaneous frequency of a signal. I. Fundamentals
1992cited by this paper
Calculation of a constant Q spectral transform
1991influential reference
Signal estimation from modified short-time Fourier transform
1983cited by this paper
The Synthesis of Complex Audio Spectra by Means of Frequency Modulation
1973cited by this paper

CITED BY

Decomposing multimodal embedding spaces with group-sparse autoencoders
2026cites this paper
SoundPlot: An Open-Source Framework for Birdsong Acoustic Analysis and Neural Synthesis with Interactive 3D Visualization
2026cites this paper
TS -DBC: A teacher-student dual-branch classifier for few-shot acoustic fault detection in substations
2026cites this paper
What Do Neurons Listen To? A Neuron-level Dissection of a General-purpose Audio Model
2026cites this paper
Masked Autoencoders for Spatio-Temporal Audio Representations: Theory and Optimization
2026cites this paper
Reframing Audio Data Annotation as Domain Adaptation Process: A Multi-Indicator Active Learning Framework
2026cites this paper
MUKA: Multi Kernel Audio Adaptation Of Audio-Language Models
2026cites this paper
Neural Audio Synthesis for Sound Effects: A Scope Review
2026cites this paper
Linear Complexity Self-Supervised Learning for Music Understanding with Random Quantizer
2026cites this paper
Diffusion Timbre Transfer Via Mutual Information Guided Inpainting
2026cites this paper
SonicBench: Dissecting the Physical Perception Bottleneck in Large Audio Language Models
2026influential citation
DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding
2026cites this paper
A Hybrid Architecture Combining Physical Modeling and Neural Networks for Piano Sound Synthesis
2026cites this paper
UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models
2026cites this paper
Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion
2026cites this paper
Musical Training, but not Mere Exposure to Music, Drives the Emergence of Chroma Equivalence in Artificial Neural Networks
2026cites this paper
Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning
2025cites this paper
Research on Pattern Generation and Structure Optimization of Neural Network Algorithms in AI Music Composition
2025cites this paper
Navigable Semantic Sound Maps for Auditory Displays
2025cites this paper
Feature centric based deep learning approach for music mood recognition with HuBERT transformer model
2025cites this paper
Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning
2025cites this paper
Smart Music Composer: A Web Tool Based on Transformer Model
2025cites this paper
Better audio representations are more brain-like: linking model-brain alignment with performance in downstream auditory tasks
2025cites this paper
Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation
2025cites this paper
Multi-modal Language models in bioacoustics with zero-shot transfer: a case study
2025cites this paper
Designing Neural Synthesizers for Low Latency Interaction
2025cites this paper
M2D-CLAP: Exploring General-Purpose Audio-Language Representations Beyond CLAP
2025cites this paper
AaPE: Aliasing-aware Patch Embedding for Self-Supervised Audio Representation Learning
2025cites this paper
DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis
2025cites this paper
Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation
2025cites this paper
Unified Timbre Transfer: A Compact Model for Real-Time Multi-Instrument Sound Morphing
2025cites this paper
Time delay embeddings to characterize the timbre of musical instruments using Topological Data Analysis: a study on synthetic and real data
2025cites this paper
Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space
2025influential citation
Multi-view Fusion and Parameter Perturbation for Few-Shot Class-Incremental Audio Classification
2025cites this paper
Development and Integration of a USB-MIDI Platform for Music Technology Education
2025cites this paper
LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model
2025cites this paper
CodecBench: A Comprehensive Benchmark for Acoustic and Semantic Evaluation
2025cites this paper
Few-Shot Class-Incremental Audio Classification Using Pseudo-Incrementally Trained Embedding Learner and Continually Updated Stochastic Classifier
2025cites this paper
FlowSynth: Instrument Generation Through Distributional Flow Matching and Test-Time Search
2025cites this paper
AIoT-Driven Privacy-Preserving and Decentralized Ubiquitous Music Learning
2025cites this paper
A Survey on Evaluation Metrics for Music Generation
2025cites this paper
Contrastive timbre representations for musical instrument and synthesizer retrieval
2025cites this paper
End-to-end Topographic Auditory Models Replicate Signatures of Human Auditory Cortex
2025influential citation
MARS: Audio Generation via Multi-Channel Autoregression on Spectrograms
2025influential citation
SALAD-VAE: Semantic Audio Compression with Language-Audio Distillation
2025cites this paper
MotionBeat: Motion-Aligned Music Representation via Embodied Contrastive Learning and Bar-Equivariant Contact-Aware Encoding
2025cites this paper
The development of deep convolutional generative adversarial network to synthesize odontocetes' clicks.
2025cites this paper
Development and Application of eTheremin: Hand Tracking and AI Technology Facilitate Educational and Entertaining Musical Training
2025cites this paper
Keep what you need : extracting efficient subnetworks from large audio representation models
2025cites this paper
Self-Supervised Representation Learning with a JEPA Framework for Multi-instrument Music Transcription
2025cites this paper
Unsupervised Pitch-Timbre-Variation Disentanglement of Monophonic Music Signals Based on Random Perturbation and Re-entry Training
2025cites this paper
Audio-FLAN: A Preliminary Release
2025cites this paper
InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation
2025cites this paper
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
2025cites this paper
Exploring timbre latent spaces: motion-enhanced sampling for musical co-improvisation
2025cites this paper
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
2025cites this paper
Generation of Musical Timbres using a Text-Guided Diffusion Model
2025cites this paper
Exploring Definitions of Quality and Diversity in Sonic Measurement Spaces
2025cites this paper
Balancing Information Preservation and Disentanglement in Self-Supervised Music Representation Learning
2025cites this paper
MiDashengLM: Efficient Audio Understanding with General Audio Captions
2025cites this paper
Analyzing the Impact of Pre-training Data Domain and Scale on Cross-Domain Performance in wav2vec 2.0 Model
2025cites this paper
Audio-JEPA: Joint-Embedding Predictive Architecture for Audio Representation Learning
2025cites this paper
OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder
2025cites this paper
Improving Classification of Marine Mammal Vocalizations Using Vision Transformers and Phase-Related Features
2025cites this paper
Real-Time Audio Spectrogram Inpainting
2025influential citation
CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following
2025cites this paper
TuneGenie: Reasoning-based LLM agents for preferential music generation
2025cites this paper
Interactive Movement-to-Audio with Pre-Trained Neural Networks
2025cites this paper
Sheet Music Benchmark: Standardized Optical Music Recognition Evaluation
2025cites this paper
A Survey of Generative Categories and Techniques in Multimodal Large Language Models
2025cites this paper
Investigating an Overfitting and Degeneration Phenomenon in Self-Supervised Multi-Pitch Estimation
2025cites this paper
Alignment of auditory artificial networks with massive individual fMRI brain data leads to generalisable improvements in brain encoding and downstream tasks
2025cites this paper
OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction
2025cites this paper
DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment
2025cites this paper
Assessing the Alignment of Audio Representations with Timbre Similarity Ratings
2025cites this paper
NNHC - a Neural Network to Hardware Compiler
2025cites this paper
PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective
2025cites this paper
From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-modal Understanding in Multimodal LLMs
2025cites this paper
MuQ: Self-Supervised Music Representation Learning With Mel Residual Vector Quantization
2025cites this paper
Cryfish: On deep audio analysis with Large Language Models
2025cites this paper
SwiftF0: Fast and Accurate Monophonic Pitch Detection
2025cites this paper
Multi-Aspect Conditioning for Diffusion-Based Music Synthesis: Enhancing Realism and Acoustic Control
2025cites this paper
AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation
2025cites this paper
AudioRWKV: Efficient and Stable Bidirectional RWKV for Audio Pattern Recognition
2025cites this paper
Neural Proxies for Sound Synthesizers: Learning Perceptually Informed Preset Representations
2025cites this paper
Ai Meets Sonification: Research Agenda and Technology Demonstration
2025cites this paper
An overview of neural architectures for self-supervised audio representation learning from masked spectrograms
2025cites this paper
An Agent-Based Framework for Automated Higher-Voice Harmony Generation
2025cites this paper
WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms
2025cites this paper
SynthCloner: Synthesizer Preset Conversion via Factorized Codec with ADSR Envelope Control
2025cites this paper
The iNaturalist Sounds Dataset
2025cites this paper
Guitar Tone Morphing by Diffusion-Based Model
2025cites this paper
Audible Networks: Deconstructing and Manipulating Sounds with Deep Non-Negative Autoencoders
2025cites this paper
Audio Palette: A Diffusion Transformer with Multi-Signal Conditioning for Controllable Foley Synthesis
2025cites this paper
A correlation-permutation approach for speech-music encoders model merging
2025cites this paper
Self-Supervised Convolutional Audio Models are Flexible Acoustic Feature Learners: A Domain Specificity and Transfer-Learning Study
2025cites this paper
A2SB: Audio-to-Audio Schrodinger Bridges
2025cites this paper
Deep Generative Models for Therapeutic Peptide Discovery: A Comprehensive Review
2025cites this paper
TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument
2025cites this paper
HarmonyAI: A Collaborative Music Composition Tool Using Generative Adversarial Networks,Variational Auto Encoder, and Retrieval Augmented Generation
2025cites this paper