Knowledge graphs meet multi-modal learning: A comprehensive survey
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
A survey on multimodal large language models
Multimodal Large Language Model (MLLM) recently has been a new rising research
hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform …
hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform …
Egoschema: A diagnostic benchmark for very long-form video language understanding
We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …
benchmark to evaluate long video understanding capabilities of modern vision and …
Llama-vid: An image is worth 2 tokens in large language models
In this work, we present a novel method to tackle the token generation challenge in Vision
Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current …
Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current …
Moviechat: From dense token to sparse memory for long video understanding
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
Mvbench: A comprehensive multi-modal video understanding benchmark
With the rapid development of Multi-modal Large Language Models (MLLMs) a number of
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …
Autoad ii: The sequel-who, when, and what in movie audio description
Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …
time intervals, for the benefit of visually impaired audiences. For movies, this presents …
Video understanding with large language models: A survey
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …
content, the demand for proficient video understanding tools has intensified markedly. Given …
Distribution-balanced loss for multi-label classification in long-tailed datasets
We present a new loss function called Distribution-Balanced Loss for the multi-label
recognition problems that exhibit long-tailed class distributions. Compared to conventional …
recognition problems that exhibit long-tailed class distributions. Compared to conventional …
Towards long-form video understanding
Our world offers a never-ending stream of visual stimuli, yet today's vision systems only
accurately recognize patterns within a few seconds. These systems understand the present …
accurately recognize patterns within a few seconds. These systems understand the present …