- Academic Search

P Peng, B Yan, S Watanabe, D Harwath - arxiv preprint arxiv:2305.11095, 2023 - arxiv.org

We investigate the emergent abilities of the recently proposed web-scale speech model
Whisper, by adapting it to unseen tasks with prompt engineering. We selected three tasks …

Save Cite Cited by 43 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] springer.com

An outlook into the future of egocentric vision

C Plizzari, G Goletto, A Furnari, S Bansal… - International Journal of …, 2024 - Springer

What will the future be? We wonder! In this survey, we explore the gap between current
research in egocentric vision and the ever-anticipated future, where wearable computing …

Save Cite Cited by 36 Related articles All 7 versions Free GPT-4

[Free GPT-4]

[PDF] thecvf.com

Avformer: Injecting vision into frozen speech models for zero-shot av-asr

PH Seo, A Nagrani, C Schmid - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a
speech recognition system by incorporating visual information. Training fully supervised …

Save Cite Cited by 10 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

SlideSpeech: A Large Scale Slide-Enriched Audio-Visual Corpus

H Wang, F Yu, X Shi, Y Wang… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org

Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional
modalities to improve the performance of speech recognition systems. While existing …

Save Cite Cited by 7 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] sjtu.edu.cn

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

J Li, C Li, Y Wu, Y Qian - IEEE/ACM Transactions on Audio …, 2024 - ieeexplore.ieee.org

Audio-Visual Speech Recognition (AVSR) is a promising approach to improving the
accuracy and robustness of speech recognition systems with the assistance of visual cues in …

Save Cite Cited by 4 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] thecvf.com

Character-aware audio-visual subtitling in context

J Huh, A Zisserman - … of the Asian Conference on Computer …, 2024 - openaccess.thecvf.com

This paper presents an improved framework for character-aware audio-visual subtitling in
TV shows. Our approach integrates speech recognition, speaker diarisation, and character …

[Free GPT-4]

[PDF] arxiv.org

Syneslm: A unified approach for audio-visual speech recognition and translation via language model and synthetic data

Y Lu, J Song, X Chang, H Bian, S Maiti… - arxiv preprint arxiv …, 2024 - arxiv.org

In this work, we present SynesLM, an unified model which can perform three multimodal
language understanding tasks: audio-visual automatic speech recognition (AV-ASR) and …

Save Cite Cited by 1 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Robust Audiovisual Speech Recognition Models with Mixture-of-Experts

Y Wu, Y Peng, Y Lu, X Chang, R Song… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org

Visual signals can enhance audiovisual speech recognition accuracy by providing
additional contextual information. Given the complexity of visual signals, an audiovisual …

[Free GPT-4]

[PDF] arxiv.org

AVATAR submission to the Ego4D AV Transcription Challenge

PH Seo, A Nagrani, C Schmid - arxiv preprint arxiv:2211.09966, 2022 - arxiv.org

In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech
Transcription Challenge 2022. Our pipeline is based on AVATAR, a state of the art encoder …

Save Cite Related articles View as HTML

[Free GPT-4]

[PDF] hal.science

Multi-Modal Learning for Video Understanding

V Gabeur - 2022 - theses.hal.science

With the ever-increasing consumption of audio-visual media on the internet, video
understanding has become an important problem in order to provide users with the right …

Create alert

Cite

Advanced search

Saved to My library

Avatar: Unconstrained audiovisual speech recognition

Prompting the hidden talent of web-scale speech models for zero-shot task generalization

An outlook into the future of egocentric vision

Avformer: Injecting vision into frozen speech models for zero-shot av-asr

SlideSpeech: A Large Scale Slide-Enriched Audio-Visual Corpus

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

Character-aware audio-visual subtitling in context

Syneslm: A unified approach for audio-visual speech recognition and translation via language model and synthetic data

Robust Audiovisual Speech Recognition Models with Mixture-of-Experts

AVATAR submission to the Ego4D AV Transcription Challenge

Multi-Modal Learning for Video Understanding