Dual-path mamba: Short and long-term bidirectional selective structured state space models for speech separation

X Jiang, C Han, N Mesgarani - arxiv preprint arxiv:2403.18257, 2024‏ - arxiv.org
Transformers have been the most successful architecture for various speech modeling tasks,
including speech separation. However, the self-attention mechanism in transformers with …

Espnet-codec: Comprehensive training and evaluation of neural codecs for audio, music, and speech

J Shi, J Tian, Y Wu, J Jung, JQ Yip… - 2024 IEEE Spoken …, 2024‏ - ieeexplore.ieee.org
Neural codecs have become crucial to recent speech and audio generation research. In
addition to signal compression capabilities, discrete codecs have also been found to …

MSFNet: Multi-scale fusion network for brain-controlled speaker extraction

C Fan, J Zhang, H Zhang, W **ang, J Tao, X Li… - Proceedings of the …, 2024‏ - dl.acm.org
Speaker extraction aims to selectively extract the target speaker from the multi-talker
environment under the guidance of auxiliary reference. Recent studies have shown that the …

TF-Locoformer: Transformer with local modeling by convolution for speech separation and enhancement

K Saijo, G Wichern, FG Germain, Z Pan… - … on Acoustic Signal …, 2024‏ - ieeexplore.ieee.org
Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation.
While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they …

LibriheavyMix: a 20,000-hour dataset for single-channel reverberant multi-talker speech separation, ASR and speaker diarization

Z **, Y Yang, M Shi, W Kang, X Yang, Z Yao… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The evolving speech processing landscape is increasingly focused on complex scenarios
like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions …

Towards audio codec-based speech separation

JQ Yip, S Zhao, D Ng, ES Chng, B Ma - arxiv preprint arxiv:2406.12434, 2024‏ - arxiv.org
Recent improvements in neural audio codec (NAC) models have generated interest in
adopting pre-trained codecs for a variety of speech processing applications to take …

Usef-tse: Universal speaker embedding free target speaker extraction

B Zeng, M Li - arxiv preprint arxiv:2409.02615, 2024‏ - arxiv.org
Target speaker extraction aims to isolate the voice of a specific speaker from mixed speech.
Traditionally, this process has relied on extracting a speaker embedding from a reference …

Separate and reconstruct: Asymmetric encoder-decoder for speech separation

UH Shin, S Lee, T Kim, HM Park - arxiv preprint arxiv:2406.05983, 2024‏ - arxiv.org
In speech separation, time-domain approaches have successfully replaced the time-
frequency domain with latent sequence feature from a learnable encoder. Conventionally …

Enhanced speech separation through a supervised approach using bidirectional long short-term memory in dual domains

S Basir, MS Hosen, MN Hossain… - Computers and …, 2024‏ - Elsevier
The process of separating individual sound sources from mono audio is a complex yet
essential endeavor in audio signal processing and analysis. This article presents an …

Early joint learning of emotion information makes multimodal model understand you better

M Ge, M Li, D Tang, P Li, K Liu, S Deng, S Pu… - Proceedings of the 2nd …, 2024‏ - dl.acm.org
In this paper, we present our solutions for emotion recognition in the sub-challenges of
Multimodal Emotion Recognition Challenge (MER2024). For the tasks MER-SEMI and MER …