Video description: A survey of methods, datasets, and evaluation metrics
Video description is the automatic generation of natural language sentences that describe
the contents of a given video. It has applications in human-robot interaction, hel** the …
the contents of a given video. It has applications in human-robot interaction, hel** the …
Multimodal research in vision and language: A review of current and emerging trends
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …
with a diverse range of modalities present in the real-world data. More recently, this has …
Vid2seq: Large-scale pretraining of a visual language model for dense video captioning
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
End-to-end generative pretraining for multimodal video captioning
Recent video and language pretraining frameworks lack the ability to generate sentences.
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …
End-to-end dense video captioning with parallel decoding
Dense video captioning aims to generate multiple associated captions with their temporal
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …
Autoad ii: The sequel-who, when, and what in movie audio description
Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …
time intervals, for the benefit of visually impaired audiences. For movies, this presents …
Self-supervised video representation learning by pace prediction
This paper addresses the problem of self-supervised video representation learning from a
new perspective–by video pace prediction. It stems from the observation that human visual …
new perspective–by video pace prediction. It stems from the observation that human visual …
Mirrorgan: Learning text-to-image generation by redescription
Generating an image from a given text description has two goals: visual realism and
semantic consistency. Although significant progress has been made in generating high …
semantic consistency. Although significant progress has been made in generating high …
Attention, please! A survey of neural attention models in deep learning
A de Santana Correia, EL Colombini - Artificial Intelligence Review, 2022 - Springer
In humans, Attention is a core property of all perceptual and cognitive operations. Given our
limited ability to process competing sources, attention mechanisms select, modulate, and …
limited ability to process competing sources, attention mechanisms select, modulate, and …
Vtimellm: Empower llm to grasp video moments
Large language models (LLMs) have shown remarkable text understanding capabilities
which have been extended as Video LLMs to handle video data for comprehending visual …
which have been extended as Video LLMs to handle video data for comprehending visual …