- Academic Search

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com

Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …

Speichern Zitieren Zitiert von: 39 Ähnliche Artikel Alle 7 Versionen HTML-Version

[Free GPT-4]

[PDF] neurips.cc

Vidchapters-7m: Video chapters at scale

A Yang, A Nagrani, I Laptev, J Sivic… - Advances in Neural …, 2024 - proceedings.neurips.cc

Segmenting untrimmed videos into chapters enables users to quickly navigate to the
information of their interest. This important topic has been understudied due to the lack of …

Speichern Zitieren Zitiert von: 28 Ähnliche Artikel Alle 19 Versionen HTML-Version

[Free GPT-4]

[PDF] thecvf.com

AutoAD: Movie description in context

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com

The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …

Speichern Zitieren Zitiert von: 58 Ähnliche Artikel Alle 7 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Cinepile: A long video question answering dataset and benchmark

R Rawal, K Saifullah, M Farré, R Basri… - ar** powerful foundation
models, driven by large-scale multimodal datasets. However, for audio representation …

Speichern Zitieren Zitiert von: 22 Ähnliche Artikel Alle 2 Versionen

[Free GPT-4]

[HTML] sciencedirect.com

[HTML][HTML] Automatic Speech Recognition: A survey of deep learning techniques and approaches

H Ahlawat, N Aggarwal, D Gupta - International Journal of Cognitive …, 2025 - Elsevier

Significant research has been conducted during the last decade on the application of
machine learning for speech processing, particularly speech recognition. However, in recent …

Speichern Zitieren Zitiert von: 1 Ähnliche Artikel

[Free GPT-4]

[PDF] thecvf.com

Mm-narrator: Narrating long-form videos with multimodal in-context learning

C Zhang, K Lin, Z Yang, J Wang, L Li… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present MM-Narrator a novel system leveraging GPT-4 with multimodal in-context
learning for the generation of audio descriptions (AD). Unlike previous methods that …

Speichern Zitieren Zitiert von: 19 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation

H He, Z Shang, C Wang, X Li, Y Gu… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org

Recent advancements in speech generation models have been significantly driven by the
use of large-scale training data. However, producing highly spontaneous, human-like …

Speichern Zitieren Zitiert von: 21 Ähnliche Artikel Alle 3 Versionen

[Free GPT-4]

[PDF] arxiv.org

Pg-video-llava: Pixel grounding large video-language models

S Munasinghe, R Thushara, M Maaz… - arxiv preprint arxiv …, 2023 - arxiv.org

Extending image-based Large Multimodal Models (LMM) to videos is challenging due to the
inherent complexity of video data. The recent approaches extending image-based LMM to …

Speichern Zitieren Zitiert von: 28 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] thecvf.com

Autoad-zero: A training-free framework for zero-shot audio description

J **e, T Han, M Bain, A Nagrani… - Proceedings of the …, 2024 - openaccess.thecvf.com

Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a
training-free manner. We use the power of off-the-shelf Video Language Models (VLMs) and …

Speichern Zitieren Zitiert von: 5 Ähnliche Artikel Alle 7 Versionen HTML-Version

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

Whisperx: Time-accurate speech transcription of long-form audio

Autoad ii: The sequel-who, when, and what in movie audio description

Vidchapters-7m: Video chapters at scale

AutoAD: Movie description in context

Cinepile: A long video question answering dataset and benchmark

[HTML][HTML] Automatic Speech Recognition: A survey of deep learning techniques and approaches

Mm-narrator: Narrating long-form videos with multimodal in-context learning

Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation

Pg-video-llava: Pixel grounding large video-language models

Autoad-zero: A training-free framework for zero-shot audio description