Joint Learning of Speaker and Phonetic Similarities with Siamese Networks

Neil Zeghidour,Gabriel Synnaeve,Nicolas Usunier,Emmanuel Dupoux

Published 2016 in Interspeech

ABSTRACT

Recent work has demonstrated, on small datasets, the feasibility of jointly learning specialized speaker and phone embeddings, in a weakly supervised siamese DNN architecture using word and speaker identity as side information. Here, we scale up these architectures to the 360 hours of the Librispeech corpus by implementing a sampling method to efficiently select pairs of words from the dataset and improving the loss function. We also compare the standard siamese networks fed with same (AA) or different (AB) pairs, to a ’triamese’ network fed with AAB triplets. We use ABX discrimination tasks to evaluate the discriminability and invariance properties of the obtained joined embeddings, and compare these results with mono-embeddings architectures. We find that the joined embeddings architectures succeed in effectively disentangling speaker from phoneme information, with around 10% errors for the matching tasks and embeddings (speaker task on speaker embeddings, and phone task on phone embedding) and near chance for the mismatched task. Furthermore, the results carry over in out-of-domain datasets, even beating the best results obtained with similar weakly supervised techniques.

PUBLICATION RECORD

Publication year
2016
Venue
Interspeech
Publication date
2016-09-08
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.21437/Interspeech.2016-811
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

A deep scattering spectrum — Deep Siamese network pipeline for unsupervised acoustic modeling
2016cited by this paper
The Zero Resource Speech Challenge 2015: Proposed Approaches and Results
2016cited by this paper
Deep convolutional acoustic word embeddings using word-pair side information
2015cited by this paper
On Invariance and Selectivity in Representation Learning
2015cited by this paper
Librispeech: An ASR corpus based on public domain audio books
2015cited by this paper
Empirical Evaluation of Rectified Activations in Convolutional Network
2015cited by this paper
Evaluating speech features with the minimal-pair ABX task (II): resistance to noise
2014cited by this paper
Phonetics embedding learning with side information
2014cited by this paper
Weakly Supervised Multi-Embeddings Learning of Acoustic Models
2014cited by this paper
Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition
2014cited by this paper
Application of Convolutional Neural Networks to Language Identification in Noisy Conditions
2014cited by this paper
A smartphone-based ASR data collection tool for under-resourced languages
2014cited by this paper
Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline
2013cited by this paper
Invariant Scattering Convolution Networks
2012cited by this paper
ADADELTA: An Adaptive Learning Rate Method
2012influential reference
Stochastic triplet embedding
2012cited by this paper
Large Scale Online Learning of Image Similarity Through Ranking
2009cited by this paper
Signature Verification Using A "Siamese" Time Delay Neural Network
1993cited by this paper
Dynamic programming algorithm optimization for spoken word recognition
1978cited by this paper

CITED BY

Reducing Label Dependency in Human Activity Recognition with Wearables: From Supervised Learning to Novel Weakly Self-Supervised Approaches
2025cites this paper
Domain Knowledge Based Weakly Self-Supervised Human Activity Recognition With Wearables
2024cites this paper
Consistency Based Weakly Self-Supervised Learning for Human Activity Recognition with Wearables
2024cites this paper
A Siamese Convolutional Neural Network for Identifying Mild Traumatic Brain Injury and Predicting Recovery
2024cites this paper
Learn-able Evolution Convolutional Siamese Neural Network for Adaptive Driving Style Preference Prediction
2023cites this paper
Antonymy-Synonymy Discrimination through Repelling Parasiamese Neural Networks
2023cites this paper
A CTC Triggered Siamese Network with Spatial-Temporal Dropout for Speech Recognition
2022cites this paper
Query-specific Subtopic Clustering
2022cites this paper
Subtopic Clustering with a Query-Speciﬁc Siamese Similarity Metric
2021cites this paper
Phoneme-Unit-Specific Time-Delay Neural Network for Speaker Verification
2021cites this paper
Siamese Neural Networks: An Overview
2021cites this paper
Feature learning for efficient ASR-free keyword spotting in low-resource languages
2021cites this paper
Content-based Music Similarity with Triplet Networks
2020cites this paper
Deep Similarity Learning for Soccer Team Ranking
2020cites this paper
Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks
2020cites this paper
The Effectiveness of Unsupervised Subword Modeling With Autoregressive and Cross-Lingual Phone-Aware Networks
2020cites this paper
Unsupervised speech representation learning
2020cites this paper
L2 Mispronunciation Verification Based on Acoustic Phone Embedding and Siamese Networks
2020cites this paper
Wavesplit: End-to-End Speech Separation by Speaker Clustering
2020cites this paper
Unsupervised Feature Learning for Speech Using Correspondence and Siamese Networks
2020influential citation
Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge
2020cites this paper
Weakly Supervised Multi-Task Representation Learning for Human Activity Analysis Using Wearables
2020cites this paper
Pre-training of Speaker Embeddings for Low-latency Speaker Change Detection in Broadcast News
2019cites this paper
Siamese Networks for Weakly Supervised Human Activity Recognition
2019cites this paper
Unsupervised representation learning for anomaly detection on neuroimaging. Application to epilepsy lesion detection on brain MRI. (Apprentissage de représentation non supervisé pour la détection d'anomalies en neuroimagerie. Application à la détection de lésions épileptogènes en Imagerie IRM)
2019cites this paper
Bottom-Up Unsupervised Word Discovery via Acoustic Units
2019cites this paper
Low-resource speech translation
2019cites this paper
Similarity Metric Based on Siamese Neural Networks for Voice Casting
2019cites this paper
Learning representations of speech from the raw waveform. (Apprentissage de représentations de la parole à partir du signal brut)
2019cites this paper
Combining Adversarial Training and Disentangled Speech Representation for Robust Zero-Resource Subword Modeling
2019cites this paper
Audio Word2vec: Sequence-to-Sequence Autoencoding for Unsupervised Learning of Audio Segmentation and Representation
2019cites this paper
NiHA: A Conscious Agent
2018cites this paper
Low-Resource Speech-to-Text Translation
2018cites this paper
Unspeech: Unsupervised Speech Context Embeddings
2018cites this paper
Exploiting Speaker and Phonetic Diversity of Mismatched Language Resources for Unsupervised Subword Modeling
2018cites this paper
Deep Siamese Architecture Based Replay Detection for Secure Voice Biometric
2018cites this paper
Phoneme Based Embedded Segmental K-Means for Unsupervised Term Discovery
2018cites this paper
Multilingual and Unsupervised Subword Modeling for Zero-Resource Languages
2018cites this paper
L2 Mispronunciation Verification Based on Acoustic Phone Embedding and Siamese Networks
2018cites this paper
Content-Style Decomposition: Representation Discovery and Applications
2018influential citation
An embedded segmental K-means model for unsupervised segmentation and clustering of speech
2017cites this paper
Towards Learning Semantic Audio Representations from Unlabeled Data
2017cites this paper
Siamese Autoencoders for Speech Style Extraction and Switching Applied to Voice Identification and Conversion
2017cites this paper
Content-Based Representations of Audio Using Siamese Neural Networks
2017cites this paper
Learning Weakly Supervised Multimodal Phoneme Embeddings
2017cites this paper
Unsupervised Learning of Semantic Audio Representations
2017cites this paper
Towards speech-to-text translation without speech recognition
2017cites this paper
A segmental framework for fully-unsupervised large-vocabulary speech recognition
2016cites this paper
A Survey of Inductive Biases for Factorial Representation-Learning
2016influential citation