AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale

Published 2018 in arXiv.org

ABSTRACT

AISHELL-1 is by far the largest open-source speech corpus available for Mandarin speech recognition research. It was released with a baseline system containing solid training and testing pipelines for Mandarin ASR. In AISHELL-2, 1000 hours of clean read-speech data from iOS is published, which is free for academic usage. On top of AISHELL-2 corpus, an improved recipe is developed and released, containing key components for industrial applications, such as Chinese word segmentation, flexible vocabulary expension and phone set transformation etc. Pipelines support various state-of-the-art techniques, such as time-delayed neural networks and Lattic-Free MMI objective funciton. In addition, we also release dev and test data from other channels(Android and Mic). For research community, we hope that AISHELL-2 corpus can be a solid resource for topics like transfer learning and robust ASR. For industry, we hope AISHELL-2 recipe can be a helpful reference for building meaningful industrial systems and products.

PUBLICATION RECORD

Publication year
2018
Venue
arXiv.org
Publication date
2018-08-31
Fields of study
Computer Science, Engineering
Identifiers
arXiv 1808.10583
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Gated Recurrent Unit Based Acoustic Modeling with Future Context
2018cited by this paper
Front-End Factor Analysis For Speaker Verification
2018cited by this paper
Multitask Learning for Phone Recognition of Underresourced Languages Using Mismatched Transcription
2018cited by this paper
AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline
2017influential reference
Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI
2016cited by this paper
THCHS-30 : A Free Chinese Speech Corpus
2015cited by this paper
A time delay neural network architecture for efficient modeling of long temporal contexts
2015cited by this paper
Microsoft COCO: Common Objects in Context
2014cited by this paper
Front-End Factor Analysis for Speaker Verification
2011cited by this paper
The Kaldi Speech Recognition Toolkit
2011influential reference
ImageNet: A large-scale hierarchical image database
2009cited by this paper
The Phonology of Standard Chinese
2001cited by this paper

CITED BY

LLM-ForcedAligner: A Non-Autoregressive and Accurate LLM-Based Forced Aligner for Multilingual and Long-Form Speech
2026cites this paper
Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects
2026cites this paper
Eureka-Audio: Triggering Audio Intelligence in Compact Language Models
2026cites this paper
DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion
2026cites this paper
ERNIE 5.0 Technical Report
2026cites this paper
SonicBench: Dissecting the Physical Perception Bottleneck in Large Audio Language Models
2026cites this paper
SAQCodec: Semantic-Acoustic Fusion and Adaptive Quantization for Ultra-Low Bitrate Speech Coding
2026cites this paper
VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models
2026cites this paper
MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models
2026cites this paper
Pseudo-Labeling Based Unsupervised Domain Adaptation for LLM-Based ASR
2026cites this paper
TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR
2026cites this paper
DashengTokenizer: One layer is enough for unified audio understanding and generation
2026cites this paper
A Novel UTI Plane Stability Calibration and Compensation Method Based on Skull Structural Relationships for Articulatory Data Observation
2026cites this paper
FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration
2025cites this paper
A Survey of Threats Against Voice Authentication and Anti-Spoofing Systems
2025cites this paper
NEXUS-O: An Omni-Perceptive and -Interactive Model for Language, Audio, and Vision
2025cites this paper
Non-Intrusive Automatic Speech Recognition Refinement: A Survey
2025cites this paper
Masked Self-distilled Transducer-based Keyword Spotting with Semi-autoregressive Decoding
2025cites this paper
OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction
2025cites this paper
Audio-FLAN: A Preliminary Release
2025cites this paper
ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark
2025cites this paper
MFA-KWS: Effective Keyword Spotting With Multi-Head Frame-Asynchronous Decoding
2025cites this paper
UniFlow: Unifying Speech Front-End Tasks via Continuous Generative Modeling
2025cites this paper
Assessing the Expressive Language Levels of Autistic Children in Home Intervention
2025cites this paper
Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English
2025cites this paper
Kimi-Audio Technical Report
2025cites this paper
Fun-ASR Technical Report
2025cites this paper
An Evaluation Framework for an Ultrasound Imaging-Based Articulatory Observation Instrument
2025cites this paper
Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-Text System
2025cites this paper
CabinSep: IR-Augmented Mask-Based MVDR for Real-Time In-Car Speech Separation with Distributed Heterogeneous Arrays
2025cites this paper
SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models
2025cites this paper
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
2025cites this paper
Token-Level Contextual Network with Ladder-Shaped Attention for End-to-End ASR
2025cites this paper
DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech
2025influential citation
Open-Set Speaker Identification Through Efficient Few-Shot Tuning With Speaker Reciprocal Points and Unknown Samples
2025cites this paper
DeepMine-multi-TTS: a Persian speech corpus for multi-speaker text-to-speech
2025cites this paper
OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia
2025influential citation
Index-ASR Technical Report
2025cites this paper
Inclusivity of AI Speech in Healthcare: A Decade Look Back
2025cites this paper
Improving Contextual ASR with Enhanced Phrase-Level Representation Based on MCTC Loss
2025influential citation
AnyExperts: On-Demand Expert Allocation for Multimodal Language Models with Mixture of Expert
2025cites this paper
Voice Cloning: Comprehensive Survey
2025cites this paper
LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models
2025cites this paper
Mel-McNet: A Mel-Scale Framework for Online Multichannel Speech Enhancement
2025cites this paper
Efficient Scaling for LLM-based ASR
2025cites this paper
Targeted Adversarial Examples for Attacking End-to-End Chinese Automatic Speech Recognition Systems
2025cites this paper
A review on speech recognition approaches and challenges for Portuguese: exploring the feasibility of fine-tuning large-scale end-to-end models
2025cites this paper
MSR-Codec: A Low-Bitrate Multi-Stream Residual Codec for High-Fidelity Speech Generation with Information Disentanglement
2025cites this paper
Whisper Based Speech Recognition for Emergency Services
2025cites this paper
MiDashengLM: Efficient Audio Understanding with General Audio Captions
2025cites this paper
VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
2025cites this paper
Direction-Guided Spatial Attention for Multichannel Speech Enhancement
2025influential citation
FunAudio-ASR Technical Report
2025cites this paper
End-to-end Speech Recognition with similar length speech and text
2025cites this paper
Task-Oriented Auxiliary Class Modeling in CTC-Based Speech Keyword Spotting
2025cites this paper
WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction
2025cites this paper
Leveraging Zipformer Model for Effective Language Identification in Code-Switched Child-Directed Speech
2025cites this paper
OSUM-EChat: Enhancing End-to-End Empathetic Spoken Chatbot via Understanding-Driven Spoken Dialogue
2025cites this paper
CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR
2025cites this paper
Knowledge Distillation Method for Pruned RNN-T Models via Pruning Bounds Sharing and Losses Confusion
2025cites this paper
Constructing a Multi-Modal Based Underwater Acoustic Target Recognition Method With a Pre-Trained Language-Audio Model
2025cites this paper
Cross-Learning Fine-Tuning Strategy for Dysarthric Speech Recognition Via CDSD database
2025cites this paper
GLAP: General contrastive audio-text pretraining across domains and languages
2025cites this paper
LLM-based AI-Powered Offline Portable Transcriptor (OPT) : SURYA-TAC: A Tactical Speech-to-Speech Translation System
2025cites this paper
Towards Efficiently Whisper Fine-tuning with Monotonic Alignments
2025cites this paper
Methods of efficient speech tokenization with multilingual semantic distillation
2025cites this paper
Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems
2025cites this paper
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
2025cites this paper
Steer-MoE: Efficient Audio-Language Alignment with a Mixture-of-Experts Steering Module
2025cites this paper
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
2025cites this paper
LongCat-Flash-Omni Technical Report
2025cites this paper
Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM
2025cites this paper
TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation
2025cites this paper
Edge-collaborative multi-channel speaker verification via spatial-temporal graph with ad-hoc microphone arrays
2025cites this paper
Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction
2025cites this paper
Mamba for Streaming ASR Combined with Unimodal Aggregation
2024cites this paper
Restorative Speech Enhancement: A Progressive Approach Using SE and Codec Modules
2024cites this paper
HDMoLE: Mixture of LoRA Experts with Hierarchical Routing and Dynamic Thresholds for Fine-Tuning LLM-based ASR Models
2024cites this paper
Continual learning and its industrial applications: A selective review
2024cites this paper
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
2024cites this paper
Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?
2024cites this paper
A Quest Through Interconnected Datasets: Lessons From Highly-Cited ICASSP Papers
2024cites this paper
Optimizing Dysarthria Wake-Up Word Spotting: an End-to-End Approach For SLT 2024 LRDWWS Challenge
2024influential citation
Improving End-to-End Speech Recognition Through Conditional Cross-Modal Knowledge Distillation with Language Model
2024cites this paper
Enhancing Code-switched Text-to-Speech Synthesis Capability in Large Language Models with only Monolingual Corpora
2024cites this paper
Summary on the Chat-Scenario Chinese Lipreading (ChatCLR) Challenge
2024cites this paper
Learning from Back Chunks: Acquiring More Future Knowledge for Streaming ASR Models via Self Distillation
2024cites this paper
OmniBench: Towards The Future of Universal Omni-Language Models
2024cites this paper
Symmetric structure for teacher-student home acoustic scene domain adaptation
2024cites this paper
OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities
2024cites this paper
Multi-Modal Knowledge Transfer for Target Speaker Lipreading with Improved Audio-Visual Pretraining and Cross-Lingual Fine-Tuning
2024cites this paper
Evaluation of an Improved Ultrasonic Imaging Helmet for Observing Articulatory Data
2024cites this paper
A Study of Speech Recognition Techniques for Dysarthria Speeches Based on Digit Recognition
2024cites this paper
Unsupervised Adaptive Speaker Recognition by Coupling-Regularized Optimal Transport
2024cites this paper
Whisper-SV: Adapting Whisper for low-data-resource speaker verification
2024cites this paper
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
2024cites this paper
Romanization Encoding For Multilingual ASR
2024cites this paper
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
2024cites this paper
Semi-Supervised Learning For Code-Switching ASR With Large Language Model Filter
2024cites this paper
Qwen2-Audio Technical Report
2024cites this paper