Long context transfer from language to vision
Video sequences offer valuable temporal information, but existing large multimodal models
(LMMs) fall short in understanding extremely long videos. Many works address this by …
(LMMs) fall short in understanding extremely long videos. Many works address this by …
Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation
The recent years have witnessed great advances in video generation. However, the
development of automatic video metrics is lagging significantly behind. None of the existing …
development of automatic video metrics is lagging significantly behind. None of the existing …
Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …
to enhance capabilities in text-rich image understanding, visual referring and grounding …
Fullanno: A data engine for enhancing image comprehension of mllms
Multimodal Large Language Models (MLLMs) have shown promise in a broad range of
vision-language tasks with their strong reasoning and generalization capabilities. However …
vision-language tasks with their strong reasoning and generalization capabilities. However …
Caption-Aware Multimodal Relation Extraction with Mutual Information Maximization
Z Zhang, W Zhang, Y Li, T Bai - Proceedings of the 32nd ACM …, 2024 - dl.acm.org
Multimodal Relation Extraction (MRE) has achieved great improvements. However, modern
MRE models are easily affected by irrelevant objects during multimodal alignment which are …
MRE models are easily affected by irrelevant objects during multimodal alignment which are …
From seconds to hours: Reviewing multimodal large language models on comprehensive long video understanding
The integration of Large Language Models (LLMs) with visual encoders has recently shown
promising performance in visual understanding tasks, leveraging their inherent capability to …
promising performance in visual understanding tasks, leveraging their inherent capability to …
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
Large Multimodal Models (LMMs) have achieved significant breakthroughs in various vision-
language and vision-centric tasks based on auto-regressive modeling. However, these …
language and vision-centric tasks based on auto-regressive modeling. However, these …