Autoad ii: The sequel-who, when, and what in movie audio description
Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …
time intervals, for the benefit of visually impaired audiences. For movies, this presents …
Vidchapters-7m: Video chapters at scale
Segmenting untrimmed videos into chapters enables users to quickly navigate to the
information of their interest. This important topic has been understudied due to the lack of …
information of their interest. This important topic has been understudied due to the lack of …
AutoAD: Movie description in context
The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …
Cinepile: A long video question answering dataset and benchmark
R Rawal, K Saifullah, M Farré, R Basri… - ar** powerful foundation
models, driven by large-scale multimodal datasets. However, for audio representation …
models, driven by large-scale multimodal datasets. However, for audio representation …
[HTML][HTML] Automatic Speech Recognition: A survey of deep learning techniques and approaches
H Ahlawat, N Aggarwal, D Gupta - International Journal of Cognitive …, 2025 - Elsevier
Significant research has been conducted during the last decade on the application of
machine learning for speech processing, particularly speech recognition. However, in recent …
machine learning for speech processing, particularly speech recognition. However, in recent …
Mm-narrator: Narrating long-form videos with multimodal in-context learning
We present MM-Narrator a novel system leveraging GPT-4 with multimodal in-context
learning for the generation of audio descriptions (AD). Unlike previous methods that …
learning for the generation of audio descriptions (AD). Unlike previous methods that …
Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation
Recent advancements in speech generation models have been significantly driven by the
use of large-scale training data. However, producing highly spontaneous, human-like …
use of large-scale training data. However, producing highly spontaneous, human-like …
Pg-video-llava: Pixel grounding large video-language models
Extending image-based Large Multimodal Models (LMM) to videos is challenging due to the
inherent complexity of video data. The recent approaches extending image-based LMM to …
inherent complexity of video data. The recent approaches extending image-based LMM to …
Autoad-zero: A training-free framework for zero-shot audio description
Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a
training-free manner. We use the power of off-the-shelf Video Language Models (VLMs) and …
training-free manner. We use the power of off-the-shelf Video Language Models (VLMs) and …