Knowledge graphs meet multi-modal learning: A comprehensive survey

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arxiv preprint arxiv …, 2024 - arxiv.org
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

Frozen in time: A joint video and image encoder for end-to-end retrieval

M Bain, A Nagrani, G Varol… - Proceedings of the …, 2021 - openaccess.thecvf.com
Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …

Autoad ii: The sequel-who, when, and what in movie audio description

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com
Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …

Long-form video-language pre-training with multimodal temporal contrastive learning

Y Sun, H Xue, R Song, B Liu… - Advances in neural …, 2022 - proceedings.neurips.cc
Large-scale video-language pre-training has shown significant improvement in video-
language understanding tasks. Previous studies of video-language pretraining mainly focus …

AutoAD: Movie description in context

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com
The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …

Towards long-form video understanding

CY Wu, P Krahenbuhl - … of the IEEE/CVF Conference on …, 2021 - openaccess.thecvf.com
Our world offers a never-ending stream of visual stimuli, yet today's vision systems only
accurately recognize patterns within a few seconds. These systems understand the present …

Learning audio-video modalities from image captions

A Nagrani, PH Seo, B Seybold, A Hauth… - … on Computer Vision, 2022 - Springer
There has been a recent explosion of large-scale image-text datasets, as images with alt-
text captions can be easily obtained online. Obtaining large-scale, high quality data for video …

Long movie clip classification with state-space video models

MM Islam, G Bertasius - European Conference on Computer Vision, 2022 - Springer
Most modern video recognition models are designed to operate on short video clips (eg, 5–
10 s in length). Thus, it is challenging to apply such models to long movie understanding …

Computational media intelligence: Human-centered machine analysis of media

K Somandepalli, T Guha, VR Martinez… - Proceedings of the …, 2021 - ieeexplore.ieee.org
Media is created by humans for humans to tell stories. There exists a natural and imminent
need for creating human-centered media analytics to illuminate the stories being told and to …

A clip-hitchhiker's guide to long video retrieval

M Bain, A Nagrani, G Varol, A Zisserman - arxiv preprint arxiv:2205.08508, 2022 - arxiv.org
Our goal in this paper is the adaptation of image-text models for long video retrieval. Recent
works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP …