Prompting the hidden talent of web-scale speech models for zero-shot task generalization

P Peng, B Yan, S Watanabe, D Harwath - arxiv preprint arxiv:2305.11095, 2023 - arxiv.org
We investigate the emergent abilities of the recently proposed web-scale speech model
Whisper, by adapting it to unseen tasks with prompt engineering. We selected three tasks …

An outlook into the future of egocentric vision

C Plizzari, G Goletto, A Furnari, S Bansal… - International Journal of …, 2024 - Springer
What will the future be? We wonder! In this survey, we explore the gap between current
research in egocentric vision and the ever-anticipated future, where wearable computing …

Avformer: Injecting vision into frozen speech models for zero-shot av-asr

PH Seo, A Nagrani, C Schmid - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a
speech recognition system by incorporating visual information. Training fully supervised …

SlideSpeech: A Large Scale Slide-Enriched Audio-Visual Corpus

H Wang, F Yu, X Shi, Y Wang… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional
modalities to improve the performance of speech recognition systems. While existing …

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

J Li, C Li, Y Wu, Y Qian - IEEE/ACM Transactions on Audio …, 2024 - ieeexplore.ieee.org
Audio-Visual Speech Recognition (AVSR) is a promising approach to improving the
accuracy and robustness of speech recognition systems with the assistance of visual cues in …

Character-aware audio-visual subtitling in context

J Huh, A Zisserman - … of the Asian Conference on Computer …, 2024 - openaccess.thecvf.com
This paper presents an improved framework for character-aware audio-visual subtitling in
TV shows. Our approach integrates speech recognition, speaker diarisation, and character …

Syneslm: A unified approach for audio-visual speech recognition and translation via language model and synthetic data

Y Lu, J Song, X Chang, H Bian, S Maiti… - arxiv preprint arxiv …, 2024 - arxiv.org
In this work, we present SynesLM, an unified model which can perform three multimodal
language understanding tasks: audio-visual automatic speech recognition (AV-ASR) and …

Robust Audiovisual Speech Recognition Models with Mixture-of-Experts

Y Wu, Y Peng, Y Lu, X Chang, R Song… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Visual signals can enhance audiovisual speech recognition accuracy by providing
additional contextual information. Given the complexity of visual signals, an audiovisual …

AVATAR submission to the Ego4D AV Transcription Challenge

PH Seo, A Nagrani, C Schmid - arxiv preprint arxiv:2211.09966, 2022 - arxiv.org
In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech
Transcription Challenge 2022. Our pipeline is based on AVATAR, a state of the art encoder …

Multi-Modal Learning for Video Understanding

V Gabeur - 2022 - theses.hal.science
With the ever-increasing consumption of audio-visual media on the internet, video
understanding has become an important problem in order to provide users with the right …