mplug-owl3: Towards long image-sequence understanding in multi-modal large language models
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities
in executing instructions for a variety of single-image tasks. Despite this progress, significant …
in executing instructions for a variety of single-image tasks. Despite this progress, significant …
A survey on evaluation of multimodal large language models
J Huang, J Zhang - arxiv preprint arxiv:2408.15769, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …
system by integrating powerful Large Language Models (LLMs) with various modality …
Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance
Multi-modal large language models (MLLMs) have demonstrated impressive performance in
vision-language tasks across a wide range of domains. However, the large model scale and …
vision-language tasks across a wide range of domains. However, the large model scale and …
Enhancing the reasoning ability of multimodal large language models via mixed preference optimization
Existing open-source multimodal large language models (MLLMs) generally follow a
training process involving pre-training and supervised fine-tuning. However, these models …
training process involving pre-training and supervised fine-tuning. However, these models …
A survey on multimodal benchmarks: In the era of large ai models
The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …
advancements in artificial intelligence, significantly enhancing the capability to understand …
Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark
Large Multimodal Models (LMMs) have made significant strides in visual question-
answering for single images. Recent advancements like long-context LMMs have allowed …
answering for single images. Recent advancements like long-context LMMs have allowed …
WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
Multimodal document understanding is a challenging task to process and comprehend large
amounts of textual and visual information. Recent advances in Large Language Models …
amounts of textual and visual information. Recent advances in Large Language Models …
Enhancing LLM trading performance with fact-subjectivity aware reasoning
While many studies prove more advanced LLMs perform better on tasks such as math and
coding, we notice that in cryptocurrency trading, stronger LLMs work worse than weaker …
coding, we notice that in cryptocurrency trading, stronger LLMs work worse than weaker …
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Vision-Language Models (VLMs) have shown promising capabilities in handling various
multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving …
multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving …
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation
Current large multimodal models (LMMs) face significant challenges in processing and
comprehending long-duration or high-resolution videos, which is mainly due to the lack of …
comprehending long-duration or high-resolution videos, which is mainly due to the lack of …