An overview of deep-learning-based audio-visual speech enhancement and separation

D Michelsanti, ZH Tan, SX Zhang, Y Xu… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org
Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …

Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification

A Moufidi, D Rousseau, P Rasti - Sensors, 2023 - mdpi.com
Multimodal deep learning, in the context of biometrics, encounters significant challenges
due to the dependence on long speech utterances and RGB images, which are often …

A Comprehensive Review of Recent Advances in Deep Neural Networks for Lipreading with Sign Language Recognition

N Rathipriya, N Maheswari - IEEE Access, 2024 - ieeexplore.ieee.org
Lip reading is a form of “listening” to people that happens visually. It's also referred to as
“Speech reading.” This is done by observing the speaker's face and listening to the spoken …

Multimodal integration for large-vocabulary audio-visual speech recognition

W Yu, S Zeiler, D Kolossa - 2020 28th European Signal …, 2021 - ieeexplore.ieee.org
For many small-and medium-vocabulary tasks, audio-visual speech recognition can
significantly improve the recognition rates compared to audio-only systems. However, there …

Audiovisual speaker tracking using nonlinear dynamical systems with dynamic stream weights

C Schymura, D Kolossa - IEEE/ACM Transactions on Audio …, 2020 - ieeexplore.ieee.org
Data fusion plays an important role in many technical applications that require efficient
processing of multimodal sensory observations. A prominent example is audiovisual signal …

A dynamic stream weight backprop Kalman filter for audiovisual speaker tracking

C Schymura, T Ochiai, M Delcroix… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
Audiovisual speaker tracking is an application that has been tackled by a wide range of
classical approaches based on Gaussian filters, most notably the well-known Kalman filter …

Extending linear dynamical systems with dynamic stream weights for audiovisual speaker localization

C Schymura, T Isenberg… - 2018 16th International …, 2018 - ieeexplore.ieee.org
An important aspect of audiovisual speaker localization is the appropriate fusion of acoustic
and visual observations based on their time-varying reliability. In this study, a framework …

Learning dynamic stream weights for linear dynamical systems using natural evolution strategies

C Schymura, D Kolossa - ICASSP 2019-2019 IEEE …, 2019 - ieeexplore.ieee.org
Multimodal data fusion is an important aspect of many object localization and tracking
frameworks that rely on sensory observations from different sources. A prominent example is …

Data fusion for audiovisual speaker localization: Extending dynamic stream weights to the spatial domain

J Wissing, B Boenninghoff, D Kolossa… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
Estimating the positions of multiple speakers can be helpful for tasks like automatic speech
recognition or speaker diarization. Both applications benefit from a known speaker position …

Machine Learning-Based Multimodal integration for Short Utterance-Based Biometrics Identification and Engagement Detection

A Moufidi - 2024 - theses.hal.science
The rapid advancement and democratization of technology have led to an abundance of
sensors. Consequently, the integration of these diverse modalities presents an advantage …