Long context transfer from language to vision

P Zhang, K Zhang, B Li, G Zeng, J Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Video sequences offer valuable temporal information, but existing large multimodal models
(LMMs) fall short in understanding extremely long videos. Many works address this by …

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

X He, D Jiang, G Zhang, M Ku, A Soni, S Siu… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent years have witnessed great advances in video generation. However, the
development of automatic video metrics is lagging significantly behind. None of the existing …

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

H Zhang, M Gao, Z Gan, P Dufter, N Wenzel… - arxiv preprint arxiv …, 2024 - arxiv.org
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …

Fullanno: A data engine for enhancing image comprehension of mllms

J Hao, Y Zhao, S Chen, Y Sun, Q Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) have shown promise in a broad range of
vision-language tasks with their strong reasoning and generalization capabilities. However …

Caption-Aware Multimodal Relation Extraction with Mutual Information Maximization

Z Zhang, W Zhang, Y Li, T Bai - Proceedings of the 32nd ACM …, 2024 - dl.acm.org
Multimodal Relation Extraction (MRE) has achieved great improvements. However, modern
MRE models are easily affected by irrelevant objects during multimodal alignment which are …

From seconds to hours: Reviewing multimodal large language models on comprehensive long video understanding

H Zou, T Luo, G **e, F Lv, G Wang, J Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
The integration of Large Language Models (LLMs) with visual encoders has recently shown
promising performance in visual understanding tasks, leveraging their inherent capability to …

Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models

Y Zhan, H Zhao, Y Zhu, F Yang, M Tang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Multimodal Models (LMMs) have achieved significant breakthroughs in various vision-
language and vision-centric tasks based on auto-regressive modeling. However, these …