Unsupervised Learning of Semantic Audio Representations

A. Jansen,Manoj Plakal,R. Pandya,D. Ellis,Shawn Hershey,Jiayang Liu,R. C. Moore,R. Saurous

Published 2017 in IEEE International Conference on Acoustics, Speech, and Signal Processing

ABSTRACT

Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the categories of the constituents, and (iii) the categories of events in close temporal proximity are likely to be the same or related. Without labels to ground them, these constraints are incompatible with classification loss functions. However, they may still be leveraged to identify geometric inequalities needed for triplet loss-based training of convolutional neural networks. The result is low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively. Moreover, in limited-supervision settings, our unsupervised embeddings double the state-of-the-art classification performance.

PUBLICATION RECORD

Publication year
2017
Venue
IEEE International Conference on Acoustics, Speech, and Signal Processing
Publication date
2017-11-06
Fields of study
Mathematics, Computer Science, Engineering
Identifiers
DOI 10.1109/ICASSP.2018.8461684 arXiv 1711.02209
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

A first attempt at polyphonic sound event detection using connectionist temporal classification
2017cited by this paper
The zero resource speech challenge 2017
2017cited by this paper
Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
2017cited by this paper
Visually Grounded Learning of Keyword Prediction from Untranscribed Speech
2017cited by this paper
Look, Listen and Learn
2017cited by this paper
Colorful Image Colorization
2016cited by this paper
Context Encoders: Feature Learning by Inpainting
2016cited by this paper
SoundNet: Learning Sound Representations from Unlabeled Video
2016cited by this paper
Joint Learning of Speaker and Phonetic Similarities with Siamese Networks
2016cited by this paper
Unsupervised Learning of Spoken Language with Visual Context
2016cited by this paper
CNN architectures for large-scale audio classification
2016cited by this paper
Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging
2016cited by this paper
Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection
2016cited by this paper
Unsupervised Visual Representation Learning by Context Prediction
2015cited by this paper
Unsupervised neural network based feature extraction using weak top-down constraints
2015cited by this paper
Unsupervised Learning of Visual Representations using Videos
2015cited by this paper
Deep convolutional acoustic word embeddings using word-pair side information
2015cited by this paper
Learning to See by Moving
2015cited by this paper
FaceNet: A unified embedding for face recognition and clustering
2015cited by this paper
Deep Metric Learning Using Triplet Network
2014cited by this paper
Weak semantic context helps phonetic learning in a model of infant language acquisition
2014cited by this paper
Learning Fine-Grained Image Similarity with Deep Ranking
2014cited by this paper
Phonetics embedding learning with side information
2014cited by this paper
Weak top-down constraints for unsupervised acoustic model training
2013cited by this paper
Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction
2011cited by this paper
Unsupervised feature learning for audio classification using convolutional deep belief networks
2009cited by this paper
Distance Metric Learning for Large Margin Nearest Neighbor Classification
2005cited by this paper
ISMIR 2008 – Session 3a – Content-Based Retrieval, Categorization and Similarity 1 LEARNING A METRIC FOR MUSIC SIMILARITY
year unknowncited by this paper

CITED BY

Massive Sound Embedding Benchmark (MSEB)
2026cites this paper
StutterFuse: Mitigating Modality Collapse in Stuttering Detection with Jaccard-Weighted Metric Learning and Gated Fusion
2025cites this paper
Contrastive timbre representations for musical instrument and synthesizer retrieval
2025cites this paper
MATE: Meet At The Embedding - Connecting Images with Long Texts
2024cites this paper
COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations
2024cites this paper
Rank Supervised Contrastive Learning for Time Series Classification
2024cites this paper
AudioRepInceptionNeXt: A lightweight single-stream architecture for efficient audio recognition
2024cites this paper
On The Effect Of Data-Augmentation On Local Embedding Properties In The Contrastive Learning Of Music Audio Representations
2024cites this paper
ECHO: Environmental Sound Classification with Hierarchical Ontology-guided Semi-Supervised Learning
2024cites this paper
AudioInceptionNeXt: TCL AI LAB Submission to EPIC-SOUND Audio-Based-Interaction-Recognition Challenge 2023
2023cites this paper
Music-PAW: Learning Music Representations via Hierarchical Part-whole Interaction and Contrast
2023cites this paper
Self-Supervised Learning for Few-Shot Bird Sound Classification
2023cites this paper
LBP4MTS: Local Binary Pattern-Based Unsupervised Representation Learning of Multivariate Time Series
2023cites this paper
Private Matrix Factorization with Public Item Features
2023cites this paper
EnCodecMAE: Leveraging neural codecs for universal audio representation learning
2023cites this paper
Optimizing Audio Augmentations for Contrastive Learning of Health-Related Acoustic Signals
2023cites this paper
Multilingual Customized Keyword Spotting Using Similar-Pair Contrastive Learning
2023cites this paper
Dissimilarity-Preserving Representation Learning for One-Class Time Series Classification
2023cites this paper
Automatic individual recognition of wild Crested Ibis based on hybrid method of self-supervised learning and clustering
2023cites this paper
Enhancing Unsupervised Audio Representation Learning via Adversarial Sample Generation
2023cites this paper
Improving Self-Supervised Learning for Audio Representations by Feature Diversity and Decorrelation
2023cites this paper
Musical Audio Similarity with Self-supervised Convolutional Neural Networks
2022cites this paper
Self-Supervised Beat Tracking in Musical Signals with Polyphonic Contrastive Learning
2022cites this paper
Zero-Shot Audio Classification Using Synthesised Classifiers and Pre-Trained Models
2022cites this paper
A Simple Siamese Framework for Vibration Signal Representations
2022cites this paper
Supervised and Unsupervised Learning of Audio Representations for Music Understanding
2022cites this paper
Representing Spatial Trajectories as Distributions
2022cites this paper
Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word Embeddings
2022cites this paper
MuLan: A Joint Embedding of Music Audio and Natural Language
2022cites this paper
DECAR: Deep Clustering for learning general-purpose Audio Representations
2022cites this paper
Gun identification from gunshot audios for secure public places using transformer learning
2022cites this paper
A Multi-Modal Convolutional Neural Network Model for Intelligent Analysis of the Influence of Music Genres on Children's Emotions
2022cites this paper
Towards Proper Contrastive Self-Supervised Learning Strategies for Music Audio Representation
2022cites this paper
Temporal Contrastive-Loss for Audio Event Detection
2022cites this paper
Wikitag: Wikipedia-Based Knowledge Embeddings Towards Improved Acoustic Event Classification
2022cites this paper
Self-Supervised Learning Method Using Multiple Sampling Strategies for General-Purpose Audio Representation
2022cites this paper
Urban Rhapsody: Large‐scale exploration of urban soundscapes
2022cites this paper
Representation Learning of Time Series Data with High-Level Semantic Features
2022cites this paper
DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning
2022cites this paper
Federated Self-Supervised Learning for Acoustic Event Classification
2022cites this paper
A Brief Overview of Unsupervised Neural Speech Representation Learning
2022cites this paper
Discrete and continuous representations and processing in deep learning: Looking forward
2022cites this paper
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
2022cites this paper
Detection of Emotion Categories' Change in Speeches
2022cites this paper
CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations
2022cites this paper
Decorrelating Feature Spaces for Learning General-Purpose Audio Representations
2021cites this paper
PSLA: Improving Audio Event Classification with Pretraining, Sampling, Labeling, and Aggregation
2021cites this paper
Enhancing Audio Augmentation Methods with Consistency Learning
2021cites this paper
BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation
2021influential citation
Multi-Format Contrastive Learning of Audio Representations
2021influential citation
Unsupervised Emitter Clustering through Deep Manifold Learning
2021cites this paper
Broaden Your Views for Self-Supervised Video Learning
2021cites this paper
Comparison and Analysis of Deep Audio Embeddings for Music Emotion Recognition
2021cites this paper
Multimodal Self-Supervised Learning of General Audio Representations
2021influential citation
Self-Supervised Learning from Automatically Separated Sound Scenes
2021influential citation
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
2021cites this paper
PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation
2021cites this paper
Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study
2021cites this paper
Learning Subgoal Representations with Slow Dynamics
2021cites this paper
Convolutional Neural Network-Based Obstructive Sleep Apnea Identification
2021cites this paper
Learning de-identified representations of prosody from raw audio
2021cites this paper
Self-supervised learning for Environmental Sound Classification
2021cites this paper
Vocal-Accompaniment Compatibility Estimation Using Self-Supervised and Joint-Embedding Techniques
2021cites this paper
Scenario Aware Speech Recognition: Advancements for Apollo Fearless Steps & CHiME-4 Corpora
2021cites this paper
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
2021cites this paper
Cross-Domain Semi-Supervised Audio Event Classification Using Contrastive Regularization
2021cites this paper
Universal Paralinguistic Speech Representations Using self-Supervised Conformers
2021cites this paper
Conformer-Based Self-Supervised Learning For Non-Speech Audio Tasks
2021cites this paper
Artificial Intelligence Methodologies for Data Management
2021cites this paper
Towards Learning Universal Audio Representations
2021cites this paper
Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations
2021cites this paper
Self-Trained Audio Tagging and Sound Event Detection in Domestic Environments
2021cites this paper
micarraylib: Software for Reproducible Aggregation, Standardization, and Signal Processing of Microphone Array Datasets
2021cites this paper
Multiple-Embedding Separation Networks: Sound Class-Specific Feature Extraction for Universal Sound Separation
2021cites this paper
Direct multimodal few-shot learning of speech and images
2020cites this paper
Towards Learning a Universal Non-Semantic Representation of Speech
2020influential citation
Multi-Task Self-Supervised Learning for Robust Speech Recognition
2020cites this paper
Unsupervised Contrastive Learning of Sound Event Representations
2020influential citation
Contrastive Predictive Coding of Audio with an Adversary
2020influential citation
Crossmodal Sound Retrieval Based on Specific Target Co-Occurrence Denoted with Weak Labels
2020cites this paper
Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags
2020cites this paper
Contrastive Learning of General-Purpose Audio Representations
2020influential citation
Dynamically Modulated Deep Metric Learning for Visual Search
2020cites this paper
A Transformer-based Framework for Multivariate Time Series Representation Learning
2020cites this paper
FSD50K: An Open Dataset of Human-Labeled Sound Events
2020influential citation
Time–frequency scattering accurately models auditory similarities between instrumental playing techniques
2020cites this paper
At the Speed of Sound: Efficient Audio Scene Classification
2020cites this paper
Automated Class Discovery and One-Shot Interactions for Acoustic Activity Recognition
2020cites this paper
Disentangled Multidimensional Metric Learning for Music Similarity
2020cites this paper
Pattern analysis based acoustic signal processing: a survey of the state-of-art
2020cites this paper
Enhanced Double-Carrier Word Embedding via Phonetics and Writing
2020cites this paper
Pre-Training Audio Representations With Self-Supervision
2020cites this paper
Semi-supervised Triplet Loss Based Learning of Ambient Audio Embeddings
2019influential citation
Unsupervised Scalable Representation Learning for Multivariate Time Series
2019cites this paper
Single Channel Sleep Staging Based on Unsupervised Feature Learning
2019cites this paper
Learning Low-Dimensional Embeddings of Audio Shingles for Cross-Version Retrieval of Classical Music
2019cites this paper
Deep Audio Prior
2019cites this paper
Expectation Learning for Stimulus Prediction Across Modalities Improves Unisensory Classification
2019cites this paper
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
2019cites this paper
Improving Universal Sound Separation Using Sound Classification
2019cites this paper