Audio-visual speech and gesture recognition by sensors of mobile devices

D Ryumin, D Ivanko, E Ryumina - Sensors, 2023 - mdpi.com
Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable
speech recognition, particularly when audio is corrupted by noise. Additional visual …

Formalizing multimedia recommendation through multimodal deep learning

D Malitesta, G Cornacchia, C Pomo, FA Merra… - ACM Transactions on …, 2024 - dl.acm.org
Recommender systems (RSs) provide customers with a personalized navigation experience
within the vast catalogs of products and services offered on popular online platforms …

Training strategies to handle missing modalities for audio-visual expression recognition

S Parthasarathy, S Sundaram - … of the 2020 International Conference on …, 2020 - dl.acm.org
Automatic audio-visual expression recognition can play an important role in communication
services such as tele-health, VOIP calls and human-machine interaction. Accuracy of audio …

Localtrans: A multiscale local transformer network for cross-resolution homography estimation

R Shao, G Wu, Y Zhou, Y Fu… - Proceedings of the …, 2021 - openaccess.thecvf.com
Cross-resolution image alignment is a key problem in multiscale gigapixel photography,
which requires to estimate homography matrix using images with large resolution gap …

Mmlatch: Bottom-up top-down fusion for multimodal sentiment analysis

G Paraskevopoulos, E Georgiou… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Current deep learning approaches for multimodal fusion rely on bottom-up fusion of high
and mid-level latent modality representations (late/mid fusion) or low level sensory inputs …

Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features

L Goncalves, C Busso - IEEE Transactions on Affective …, 2022 - ieeexplore.ieee.org
Emotion recognition using audiovisual features is a challenging task for human-machine
interaction systems. Under ideal conditions (perfect illumination, clean speech signals, and …

Avformer: Injecting vision into frozen speech models for zero-shot av-asr

PH Seo, A Nagrani, C Schmid - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a
speech recognition system by incorporating visual information. Training fully supervised …

Self-supervised learning with cross-modal transformers for emotion recognition

A Khare, S Parthasarathy… - 2021 IEEE spoken …, 2021 - ieeexplore.ieee.org
Emotion recognition is a challenging task due to limited availability of in-the-wild labeled
datasets. Self-supervised learning has shown improvements on tasks with limited labeled …

Auditory attention detection via cross-modal attention

S Cai, P Li, E Su, L **e - Frontiers in neuroscience, 2021 - frontiersin.org
Humans show a remarkable perceptual ability to select the speech stream of interest among
multiple competing speakers. Previous studies demonstrated that auditory attention …

ASR-aware end-to-end neural diarization

A Khare, E Han, Y Yang… - ICASSP 2022-2022 IEEE …, 2022 - ieeexplore.ieee.org
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both
acoustic input and features derived from an automatic speech recognition (ASR) model. Two …