Autoad ii: The sequel-who, when, and what in movie audio description

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com
Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …

Vidchapters-7m: Video chapters at scale

A Yang, A Nagrani, I Laptev, J Sivic… - Advances in Neural …, 2024 - proceedings.neurips.cc
Segmenting untrimmed videos into chapters enables users to quickly navigate to the
information of their interest. This important topic has been understudied due to the lack of …

AutoAD: Movie description in context

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com
The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …

Cinepile: A long video question answering dataset and benchmark

R Rawal, K Saifullah, M Farré, R Basri… - ar** powerful foundation
models, driven by large-scale multimodal datasets. However, for audio representation …

[HTML][HTML] Automatic Speech Recognition: A survey of deep learning techniques and approaches

H Ahlawat, N Aggarwal, D Gupta - International Journal of Cognitive …, 2025 - Elsevier
Significant research has been conducted during the last decade on the application of
machine learning for speech processing, particularly speech recognition. However, in recent …

Mm-narrator: Narrating long-form videos with multimodal in-context learning

C Zhang, K Lin, Z Yang, J Wang, L Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present MM-Narrator a novel system leveraging GPT-4 with multimodal in-context
learning for the generation of audio descriptions (AD). Unlike previous methods that …

Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation

H He, Z Shang, C Wang, X Li, Y Gu… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Recent advancements in speech generation models have been significantly driven by the
use of large-scale training data. However, producing highly spontaneous, human-like …

Pg-video-llava: Pixel grounding large video-language models

S Munasinghe, R Thushara, M Maaz… - arxiv preprint arxiv …, 2023 - arxiv.org
Extending image-based Large Multimodal Models (LMM) to videos is challenging due to the
inherent complexity of video data. The recent approaches extending image-based LMM to …

Autoad-zero: A training-free framework for zero-shot audio description

J **e, T Han, M Bain, A Nagrani… - Proceedings of the …, 2024 - openaccess.thecvf.com
Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a
training-free manner. We use the power of off-the-shelf Video Language Models (VLMs) and …