Prompting the hidden talent of web-scale speech models for zero-shot task generalization
We investigate the emergent abilities of the recently proposed web-scale speech model
Whisper, by adapting it to unseen tasks with prompt engineering. We selected three tasks …
Whisper, by adapting it to unseen tasks with prompt engineering. We selected three tasks …
An outlook into the future of egocentric vision
What will the future be? We wonder! In this survey, we explore the gap between current
research in egocentric vision and the ever-anticipated future, where wearable computing …
research in egocentric vision and the ever-anticipated future, where wearable computing …
Avformer: Injecting vision into frozen speech models for zero-shot av-asr
Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a
speech recognition system by incorporating visual information. Training fully supervised …
speech recognition system by incorporating visual information. Training fully supervised …
SlideSpeech: A Large Scale Slide-Enriched Audio-Visual Corpus
Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional
modalities to improve the performance of speech recognition systems. While existing …
modalities to improve the performance of speech recognition systems. While existing …
Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond
Audio-Visual Speech Recognition (AVSR) is a promising approach to improving the
accuracy and robustness of speech recognition systems with the assistance of visual cues in …
accuracy and robustness of speech recognition systems with the assistance of visual cues in …
Character-aware audio-visual subtitling in context
This paper presents an improved framework for character-aware audio-visual subtitling in
TV shows. Our approach integrates speech recognition, speaker diarisation, and character …
TV shows. Our approach integrates speech recognition, speaker diarisation, and character …
Syneslm: A unified approach for audio-visual speech recognition and translation via language model and synthetic data
In this work, we present SynesLM, an unified model which can perform three multimodal
language understanding tasks: audio-visual automatic speech recognition (AV-ASR) and …
language understanding tasks: audio-visual automatic speech recognition (AV-ASR) and …
Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
Visual signals can enhance audiovisual speech recognition accuracy by providing
additional contextual information. Given the complexity of visual signals, an audiovisual …
additional contextual information. Given the complexity of visual signals, an audiovisual …
AVATAR submission to the Ego4D AV Transcription Challenge
In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech
Transcription Challenge 2022. Our pipeline is based on AVATAR, a state of the art encoder …
Transcription Challenge 2022. Our pipeline is based on AVATAR, a state of the art encoder …
Multi-Modal Learning for Video Understanding
V Gabeur - 2022 - theses.hal.science
With the ever-increasing consumption of audio-visual media on the internet, video
understanding has become an important problem in order to provide users with the right …
understanding has become an important problem in order to provide users with the right …