Audioslots: A Slot-Centric Generative Model For Audio Separation

P. Reddy,Scott Wisdom,Klaus Greff,J. Hershey,Thomas Kipf

Published 2023 in 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)

ABSTRACT

In a range of recent works, object-centric architectures have been shown to be suitable for unsupervised scene decomposition in the vision domain. Inspired by these methods we present AudioSlots, a slot-centric generative model for blind source separation in the audio domain. AudioSlots is built using permutation-equivariant encoder and decoder networks. The encoder network based on the Transformer architecture learns to map a mixed audio spectrogram to an unordered set of independent source embeddings. The spatial broadcast decoder network learns to generate the source spectrograms from the source embeddings. We train the model in an end-to-end manner using a permutation invariant loss function. Our results on Libri2Mix speech separation constitute a proof of concept that this approach shows promise. We discuss the results and limitations of our approach in detail, and further outline potential ways to overcome the limitations and directions for future work.

PUBLICATION RECORD

Publication year
2023
Venue
2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)
Publication date
2023-05-09
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.1109/ICASSPW59220.2023.10193208 arXiv 2305.05591
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Separate And Diffuse: Using a Pretrained Diffusion Model for Improving Source Separation
2023cited by this paper
Multi-instrument Music Synthesis with Spectrogram Diffusion
2022cited by this paper
AudioLM: A Language Modeling Approach to Audio Generation
2022cited by this paper
Search for Concepts: Discovering Visual Concepts Using Direct Optimization
2022cited by this paper
Masked Autoencoders that Listen
2022cited by this paper
An efficient encoder-decoder architecture with top-down attention for speech separation
2022cited by this paper
TF-GRIDNET: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation
2022cited by this paper
DF-Conformer: Integrated Architecture of Conv-Tasnet and Conformer Using Linear Complexity Self-Attention for Speech Enhancement
2021cited by this paper
Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation
2021cited by this paper
Conditional Object-Centric Learning from Video
2021cited by this paper
Attention Is All You Need In Speech Separation
2020cited by this paper
Unsupervised Sound Separation Using Mixture Invariant Training
2020cited by this paper
LibriMix: An Open-Source Dataset for Generalizable Speech Separation
2020influential reference
End-to-End Object Detection with Transformers
2020cited by this paper
Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains
2020cited by this paper
Object-Centric Learning with Slot Attention
2020influential reference
Conditional Set Generation with Transformers
2020cited by this paper
SEANet: A Multi-modal Speech Enhancement Network
2020cited by this paper
Deep Set Prediction Networks
2019cited by this paper
Spatial Broadcast Decoder: A Simple Architecture for Learning Disentangled Representations in VAEs
2019cited by this paper
Multi-Object Representation Learning with Iterative Variational Inference
2019cited by this paper
Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity
2019cited by this paper
Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation
2019cited by this paper
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
2018cited by this paper
SDR – Half-baked or Well Done?
2018influential reference
Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks
2017cited by this paper
Deep Sets
2017cited by this paper
Attention is All you Need
2017cited by this paper
Deep attractor network for single-microphone speaker separation
2016cited by this paper
Single-Channel Multi-Speaker Separation Using Deep Clustering
2016cited by this paper
Deep clustering: Discriminative embeddings for segmentation and separation
2015cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks
2015cited by this paper
Librispeech: An ASR corpus based on public domain audio books
2015cited by this paper
Order Matters: Sequence to sequence for sets
2015cited by this paper
Discriminatively trained recurrent neural networks for single-channel speech separation
2014cited by this paper
Adam: A Method for Stochastic Optimization
2014influential reference
On the optimality of ideal binary time-frequency masks
2008cited by this paper

CITED BY

Recurrent Complex-Weighted Autoencoders for Unsupervised Object Discovery
2024cites this paper
DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation
2024cites this paper
Compositional Audio Representation Learning
2024cites this paper
Object-centric architectures enable efficient causal representation learning
2023cites this paper
Unsupervised Musical Object Discovery from Audio
2023cites this paper
O BJECT - CENTRIC ARCHITECTURES ENABLE EFFICIENT CAUSAL REPRESENTATION LEARNING
year unknowncites this paper