USEV: Universal speaker extraction with visual cue

Z Pan, M Ge, H Li - IEEE/ACM Transactions on Audio, Speech …, 2022 - ieeexplore.ieee.org
A speaker extraction algorithm seeks to extract the target speaker's speech from a multi-
talker speech mixture. The prior studies focus mostly on speaker extraction from a highly …

Audio-visual end-to-end multi-channel speech separation, dereverberation and recognition

G Li, J Deng, M Geng, Z **, T Wang… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
Accurate recognition of cocktail party speech containing overlap** speakers, noise and
reverberation remains a highly challenging task to date. Motivated by the invariance of …

Unified cross-modal attention: robust audio-visual speech recognition and beyond

J Li, C Li, Y Wu, Y Qian - IEEE/ACM Transactions on Audio …, 2024 - ieeexplore.ieee.org
Audio-Visual Speech Recognition (AVSR) is a promising approach to improving the
accuracy and robustness of speech recognition systems with the assistance of visual cues in …

Mx2m: masked cross-modality modeling in domain adaptation for 3d semantic segmentation

B Zhang, Z Wang, Y Ling, Y Guan, S Zhang… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Existing methods of cross-modal domain adaptation for 3D semantic segmentation predict
results only via 2D-3D complementarity that is obtained by cross-modal feature matching …

Scenario-aware audio-visual TF-Gridnet for target speech extraction

Z Pan, G Wichern, Y Masuyama… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Target speech extraction aims to extract, based on a given conditioning cue, a target speech
signal that is corrupted by interfering sources, such as noise or competing speakers …

ImagineNet: Target speaker extraction with intermittent visual cue through embedding inpainting

Z Pan, W Wang, M Borsdorf, H Li - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
The speaker extraction technique seeks to single out the voice of a target speaker from the
interfering voices in a speech mixture. Typically an auxiliary reference of the target speaker …

LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech Enhancement

A Jain, JS Sanjotra, H Choudhary, K Agrawal… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we propose long short term memory speech enhancement network (LSTMSE-
Net), an audio-visual speech enhancement (AVSE) method. This innovative method …

Efficient audio–visual information fusion using encoding pace synchronization for Audio–Visual Speech Separation

X Xu, W Tu, Y Yang - Information Fusion, 2025 - Elsevier
Contemporary audio–visual speech separation (AVSS) models typically use encoders that
merge audio and visual representations by concatenating them at a specific layer. This …

Deep complex u-net with conformer for audio-visual speech enhancement

S Ahmed, CW Chen, W Ren, CJ Li, E Chu… - arxiv preprint arxiv …, 2023 - arxiv.org
Recent studies have increasingly acknowledged the advantages of incorporating visual data
into speech enhancement (SE) systems. In this paper, we introduce a novel audio-visual SE …

MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues

J Li, K Zhang, S Wang, KA Lee, H Li - arxiv preprint arxiv:2412.08247, 2024 - arxiv.org
Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate the speech of a specific
target speaker from an audio mixture using time-synchronized visual cues. In real-world …