- Academic Search

D Ryumin, D Ivanko, E Ryumina - Sensors, 2023 - mdpi.com

Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable
speech recognition, particularly when audio is corrupted by noise. Additional visual …

Gem Citer Citeret af 73 Relaterede artikler Alle 8 versioner Cached

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Formalizing multimedia recommendation through multimodal deep learning

D Malitesta, G Cornacchia, C Pomo, FA Merra… - ACM Transactions on …, 2024 - dl.acm.org

Recommender systems (RSs) provide customers with a personalized navigation experience
within the vast catalogs of products and services offered on popular online platforms …

Gem Citer Citeret af 14 Relaterede artikler Alle 5 versioner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Training strategies to handle missing modalities for audio-visual expression recognition

S Parthasarathy, S Sundaram - … of the 2020 International Conference on …, 2020 - dl.acm.org

Automatic audio-visual expression recognition can play an important role in communication
services such as tele-health, VOIP calls and human-machine interaction. Accuracy of audio …

Gem Citer Citeret af 80 Relaterede artikler Alle 4 versioner

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Localtrans: A multiscale local transformer network for cross-resolution homography estimation

R Shao, G Wu, Y Zhou, Y Fu… - Proceedings of the …, 2021 - openaccess.thecvf.com

Cross-resolution image alignment is a key problem in multiscale gigapixel photography,
which requires to estimate homography matrix using images with large resolution gap …

Gem Citer Citeret af 50 Relaterede artikler Alle 11 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mmlatch: Bottom-up top-down fusion for multimodal sentiment analysis

G Paraskevopoulos, E Georgiou… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org

Current deep learning approaches for multimodal fusion rely on bottom-up fusion of high
and mid-level latent modality representations (late/mid fusion) or low level sensory inputs …

Gem Citer Citeret af 43 Relaterede artikler Alle 5 versioner

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features

L Goncalves, C Busso - IEEE Transactions on Affective …, 2022 - ieeexplore.ieee.org

Emotion recognition using audiovisual features is a challenging task for human-machine
interaction systems. Under ideal conditions (perfect illumination, clean speech signals, and …

Gem Citer Citeret af 26 Relaterede artikler Alle 6 versioner

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Avformer: Injecting vision into frozen speech models for zero-shot av-asr

PH Seo, A Nagrani, C Schmid - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a
speech recognition system by incorporating visual information. Training fully supervised …

Gem Citer Citeret af 10 Relaterede artikler Alle 5 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Self-supervised learning with cross-modal transformers for emotion recognition

A Khare, S Parthasarathy… - 2021 IEEE spoken …, 2021 - ieeexplore.ieee.org

Emotion recognition is a challenging task due to limited availability of in-the-wild labeled
datasets. Self-supervised learning has shown improvements on tasks with limited labeled …

Gem Citer Citeret af 55 Relaterede artikler Alle 4 versioner

[Free GPT-4]
[DeepSeek]

[PDF] frontiersin.org

Auditory attention detection via cross-modal attention

S Cai, P Li, E Su, L **e - Frontiers in neuroscience, 2021 - frontiersin.org

Humans show a remarkable perceptual ability to select the speech stream of interest among
multiple competing speakers. Previous studies demonstrated that auditory attention …

Gem Citer Citeret af 27 Relaterede artikler Alle 6 versioner Cached

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

ASR-aware end-to-end neural diarization

A Khare, E Han, Y Yang… - ICASSP 2022-2022 IEEE …, 2022 - ieeexplore.ieee.org

We present a Conformer-based end-to-end neural diarization (EEND) model that uses both
acoustic input and features derived from an automatic speech recognition (ASR) model. Two …

Gem Citer Citeret af 23 Relaterede artikler Alle 10 versioner

Opret underretning

Citer

Avanceret søgning

Gemt i Min samling

Multimodal and multiresolution speech recognition with transformers

Audio-visual speech and gesture recognition by sensors of mobile devices

Formalizing multimedia recommendation through multimodal deep learning

Training strategies to handle missing modalities for audio-visual expression recognition

Localtrans: A multiscale local transformer network for cross-resolution homography estimation

Mmlatch: Bottom-up top-down fusion for multimodal sentiment analysis

Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features

Avformer: Injecting vision into frozen speech models for zero-shot av-asr

Self-supervised learning with cross-modal transformers for emotion recognition

Auditory attention detection via cross-modal attention

ASR-aware end-to-end neural diarization