Low Complexity Speech Enhancement Network Based on Frame-Level Swin Transformer

Weiqi Jiang,Chengli Sun,Feilong Chen,Y. Leng,Qiaosheng Guo,Jiayi Sun,Jiankun Peng

Published 2023 in Electronics

ABSTRACT

In recent years, Transformer has shown great performance in speech enhancement by applying multi-head self-attention to capture long-term dependencies effectively. However, the computation of Transformer is quadratic with the input speech spectrograms, which makes it computationally expensive for practical use. In this paper, we propose a low complexity hierarchical frame-level Swin Transformer network (FLSTN) for speech enhancement. FLSTN takes several consecutive frames as a local window and restricts self-attention within it, reducing the complexity to linear with spectrogram size. A shifted window mechanism enhances information exchange between adjacent windows, so that window-based local attention becomes disguised global attention. The hierarchical structure allows FLSTN to learn speech features at different scales. Moreover, we designed the band merging layer and the band expanding layer for decreasing and increasing the spatial resolution of feature maps, respectively. We tested FLSTN on both 16 kHz wide-band speech and 48 kHz full-band speech. Experimental results demonstrate that FLSTN can handle speech with different bandwidths well. With very few multiply–accumulate operations (MACs), FLSTN not only has a significant advantage in computational complexity but also achieves comparable objective speech quality metrics with current state-of-the-art (SOTA) models.

PUBLICATION RECORD

Publication year
2023
Venue
Electronics
Publication date
2023-03-10
Fields of study
Not labeled
Identifiers
DOI 10.3390/electronics12061330
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

FullSubNet+: Channel Attention Fullsubnet with Complex Spectrograms for Speech Enhancement
2022cited by this paper
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection
2022cited by this paper
Uformer: A Unet Based Dilated Complex & Real Dual-Path Conformer Network for Simultaneous Speech Enhancement and Dereverberation
2021cited by this paper
Deepfilternet: A Low Complexity Speech Enhancement Framework for Full-Band Audio Based On Deep Filtering
2021cited by this paper
DCCRN+: Channel-wise Subband DCCRN with SNR Estimation for Speech Enhancement
2021cited by this paper
SETransformer: Speech Enhancement Transformer
2021cited by this paper
S-DCCRN: Super Wide Band DCCRN with Learnable Complex Feature for Speech Enhancement
2021cited by this paper
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
2021influential reference
TSTNN: Two-Stage Transformer Based Neural Network for Speech Enhancement in the Time Domain
2021cited by this paper
Towards Efficient Models for Real-Time Deep Noise Suppression
2021cited by this paper
Interspeech 2021 Deep Noise Suppression Challenge
2021cited by this paper
Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression
2020cited by this paper
Distortionless Multi-Channel Target Speech Enhancement for Overlapped Speech Recognition
2020cited by this paper
A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech
2020cited by this paper
DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement
2020influential reference
PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network
2019cited by this paper
Phase-aware Speech Enhancement with Deep Complex U-Net
2019cited by this paper
Deep Filtering: Signal Extraction and Reconstruction Using Complex Time-Frequency Filters
2019cited by this paper
TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain
2019cited by this paper
A Speech Enhancement Neural Network Architecture with SNR-Progressive Multi-Target Learning for Robust Speech Recognition
2019cited by this paper
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
2018cited by this paper
Attention is All you Need
2017cited by this paper
A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement
2017cited by this paper
Complex Ratio Masking for Monaural Speech Separation
2016cited by this paper
An Introduction to Convolutional Neural Networks
2015cited by this paper
A novel speech enhancement method based on constrained low-rank and sparse matrix decomposition
2014cited by this paper
The voice bank corpus: Design, collection and data analysis of a large regional accent speech database
2013cited by this paper
The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings
2013cited by this paper
An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech
2011cited by this paper
Evaluation of Objective Quality Measures for Speech Enhancement
2008cited by this paper
Speech Enhancement: Theory and Practice
2007cited by this paper
Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs
2001cited by this paper
Speech enhancement based on a priori signal to noise estimation
1996cited by this paper
Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems
1993cited by this paper
The Design for the Wall Street Journal-based CSR Corpus
1992cited by this paper
Suppression of acoustic noise in speech using spectral subtraction
1979cited by this paper

CITED BY

A robust framework for noisy speech recognition using Frequency-Guided-Swin Transformer
2026cites this paper
Whisper-Aware Spectro-Transformer U-Net for Emotion- Preserving Multilingual Speech Enhancement
2026cites this paper
Fast-ULCNet: A fast and ultra low complexity network for single-channel speech enhancement
2026cites this paper
CTSE-Net: Resource-efficient convolutional and TF-transformer network for speech enhancement
2025cites this paper
Transformers in speech processing: Overcoming challenges and paving the future
2025cites this paper
MBTU-SE: A Speech Enhancement Network Integrates Enhanced Taylor Multi-Branch Linear Transformer With U-Net Architecture
2025cites this paper
DPHT-ANet: Dual-path high-order transformer-style fully attentional network for monaural speech enhancement
2024cites this paper
A Two-Stage Beamforming and Diffusion-Based Refiner System for 3D Speech Enhancement
2024cites this paper
Real-Time Audio Noise Reduction and Speech Enhancement Using LadderNet With Hybrid Spectrogram Time-Domain Audio Separation Network
2024cites this paper
CheapNET: Improving Light-weight speech enhancement network by projected loss function
2023cites this paper
Transformers in Speech Processing: A Survey
2023cites this paper