Knowledge graphs meet multi-modal learning: A comprehensive survey
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
Frozen in time: A joint video and image encoder for end-to-end retrieval
Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …
efficient text-to-video retrieval. The challenges in this area include the design of the visual …
Autoad ii: The sequel-who, when, and what in movie audio description
Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …
time intervals, for the benefit of visually impaired audiences. For movies, this presents …
Long-form video-language pre-training with multimodal temporal contrastive learning
Large-scale video-language pre-training has shown significant improvement in video-
language understanding tasks. Previous studies of video-language pretraining mainly focus …
language understanding tasks. Previous studies of video-language pretraining mainly focus …
AutoAD: Movie description in context
The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …
Towards long-form video understanding
Our world offers a never-ending stream of visual stimuli, yet today's vision systems only
accurately recognize patterns within a few seconds. These systems understand the present …
accurately recognize patterns within a few seconds. These systems understand the present …
Learning audio-video modalities from image captions
There has been a recent explosion of large-scale image-text datasets, as images with alt-
text captions can be easily obtained online. Obtaining large-scale, high quality data for video …
text captions can be easily obtained online. Obtaining large-scale, high quality data for video …
Long movie clip classification with state-space video models
Most modern video recognition models are designed to operate on short video clips (eg, 5–
10 s in length). Thus, it is challenging to apply such models to long movie understanding …
10 s in length). Thus, it is challenging to apply such models to long movie understanding …
Computational media intelligence: Human-centered machine analysis of media
Media is created by humans for humans to tell stories. There exists a natural and imminent
need for creating human-centered media analytics to illuminate the stories being told and to …
need for creating human-centered media analytics to illuminate the stories being told and to …
A clip-hitchhiker's guide to long video retrieval
Our goal in this paper is the adaptation of image-text models for long video retrieval. Recent
works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP …
works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP …