Autoad: Movie description in context

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com
The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …

AAP-MIT: Attentive atrous pyramid network and memory incorporated transformer for multisentence video description

J Prudviraj, MI Reddy, C Vishnu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Generating multi-sentence descriptions for video is considered to be the most complex task
in computer vision and natural language understanding due to the intricate nature of video …

Trends in integration of vision and language research: A survey of tasks, datasets, and methods

A Mogadala, M Kalimuthu, D Klakow - Journal of Artificial Intelligence …, 2021 - jair.org
Abstract Interest in Artificial Intelligence (AI) and its applications has seen unprecedented
growth in the last few years. This success can be partly attributed to the advancements made …

Lmeye: An interactive perception network for large language models

Y Li, B Hu, X Chen, L Ma, Y Xu… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Current efficient approaches to building Multimodal Large Language Models (MLLMs)
mainly incorporate visual information into LLMs with a simple visual map** network such …

Compute to tell the tale: Goal-driven narrative generation

Y Wong, S Fan, Y Guo, Z Xu, K Stephen… - Proceedings of the 30th …, 2022 - dl.acm.org
Man is by nature a social animal. One important facet of human evolution is through
narrative imagination, be it fictional or factual, and to tell the tale to other individuals. The …

Unified adaptive relevance distinguishable attention network for image-text matching

K Zhang, Z Mao, AA Liu, Y Zhang - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Image-text matching, as a fundamental cross-modal task, bridges the gap between vision
and language. The core is to accurately learn semantic alignment to find relevant shared …

What makes a good story and how can we measure it? a comprehensive survey of story evaluation

D Yang, Q ** - arxiv preprint arxiv:2408.14622, 2024 - arxiv.org
With the development of artificial intelligence, particularly the success of Large Language
Models (LLMs), the quantity and quality of automatically generated stories have significantly …

Shot2story20k: A new benchmark for comprehensive understanding of multi-shot videos

M Han, L Yang, X Chang, H Wang - arxiv preprint arxiv:2312.10300, 2023 - arxiv.org
A short clip of video may contain progression of multiple events and an interesting story line.
A human need to capture both the event in every shot and associate them together to …

Image retrieval from contextual descriptions

B Krojer, V Adlakha, V Vineet, Y Goyal, E Ponti… - arxiv preprint arxiv …, 2022 - arxiv.org
The ability to integrate context, including perceptual and temporal cues, plays a pivotal role
in grounding the meaning of a linguistic utterance. In order to measure to what extent current …

Image difference captioning with instance-level fine-grained feature representation

Q Huang, Y Liang, J Wei, Y Cai, H Liang… - IEEE transactions on …, 2021 - ieeexplore.ieee.org
The task of image difference captioning aims at locating changed objects in similar image
pairs and describing the difference with natural language. The key challenges of this task …