An overview of deep-learning-based audio-visual speech enhancement and separation
Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …
extract either one or more target speech signals, respectively, from a mixture of sounds …
Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification
Multimodal deep learning, in the context of biometrics, encounters significant challenges
due to the dependence on long speech utterances and RGB images, which are often …
due to the dependence on long speech utterances and RGB images, which are often …
A Comprehensive Review of Recent Advances in Deep Neural Networks for Lipreading with Sign Language Recognition
Lip reading is a form of “listening” to people that happens visually. It's also referred to as
“Speech reading.” This is done by observing the speaker's face and listening to the spoken …
“Speech reading.” This is done by observing the speaker's face and listening to the spoken …
Multimodal integration for large-vocabulary audio-visual speech recognition
For many small-and medium-vocabulary tasks, audio-visual speech recognition can
significantly improve the recognition rates compared to audio-only systems. However, there …
significantly improve the recognition rates compared to audio-only systems. However, there …
Audiovisual speaker tracking using nonlinear dynamical systems with dynamic stream weights
Data fusion plays an important role in many technical applications that require efficient
processing of multimodal sensory observations. A prominent example is audiovisual signal …
processing of multimodal sensory observations. A prominent example is audiovisual signal …
A dynamic stream weight backprop Kalman filter for audiovisual speaker tracking
Audiovisual speaker tracking is an application that has been tackled by a wide range of
classical approaches based on Gaussian filters, most notably the well-known Kalman filter …
classical approaches based on Gaussian filters, most notably the well-known Kalman filter …
Extending linear dynamical systems with dynamic stream weights for audiovisual speaker localization
C Schymura, T Isenberg… - 2018 16th International …, 2018 - ieeexplore.ieee.org
An important aspect of audiovisual speaker localization is the appropriate fusion of acoustic
and visual observations based on their time-varying reliability. In this study, a framework …
and visual observations based on their time-varying reliability. In this study, a framework …
Learning dynamic stream weights for linear dynamical systems using natural evolution strategies
Multimodal data fusion is an important aspect of many object localization and tracking
frameworks that rely on sensory observations from different sources. A prominent example is …
frameworks that rely on sensory observations from different sources. A prominent example is …
Data fusion for audiovisual speaker localization: Extending dynamic stream weights to the spatial domain
Estimating the positions of multiple speakers can be helpful for tasks like automatic speech
recognition or speaker diarization. Both applications benefit from a known speaker position …
recognition or speaker diarization. Both applications benefit from a known speaker position …
Machine Learning-Based Multimodal integration for Short Utterance-Based Biometrics Identification and Engagement Detection
A Moufidi - 2024 - theses.hal.science
The rapid advancement and democratization of technology have led to an abundance of
sensors. Consequently, the integration of these diverse modalities presents an advantage …
sensors. Consequently, the integration of these diverse modalities presents an advantage …