mplug-owl3: Towards long image-sequence understanding in multi-modal large language models

J Ye, H Xu, H Liu, A Hu, M Yan, Q Qian, J Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities
in executing instructions for a variety of single-image tasks. Despite this progress, significant …

A survey on evaluation of multimodal large language models

J Huang, J Zhang - arxiv preprint arxiv:2408.15769, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Z Gao, Z Chen, E Cui, Y Ren, W Wang, J Zhu, H Tian… - Visual Intelligence, 2024 - Springer
Multi-modal large language models (MLLMs) have demonstrated impressive performance in
vision-language tasks across a wide range of domains. However, the large model scale and …

Enhancing the reasoning ability of multimodal large language models via mixed preference optimization

W Wang, Z Chen, W Wang, Y Cao, Y Liu, Z Gao… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing open-source multimodal large language models (MLLMs) generally follow a
training process involving pre-training and supervised fine-tuning. However, these models …

A survey on multimodal benchmarks: In the era of large ai models

L Li, G Chen, H Shi, J **ao, L Chen - arxiv preprint arxiv:2409.18142, 2024 - arxiv.org
The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

TH Wu, G Biamby, J Quenum, R Gupta… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Multimodal Models (LMMs) have made significant strides in visual question-
answering for single images. Recent advancements like long-context LMMs have allowed …

WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

X **e, H Yan, L Yin, Y Liu, J Ding, M Liao, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal document understanding is a challenging task to process and comprehend large
amounts of textual and visual information. Recent advances in Large Language Models …

Enhancing LLM trading performance with fact-subjectivity aware reasoning

Q Wang, Y Gao, Z Tang, B Luo, B He - arxiv preprint arxiv:2410.12464, 2024 - arxiv.org
While many studies prove more advanced LLMs perform better on tasks such as math and
coding, we notice that in cryptocurrency trading, stronger LLMs work worse than weaker …

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

J Ge, Z Chen, J Lin, J Zhu, X Liu, J Dai… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-Language Models (VLMs) have shown promising capabilities in handling various
multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving …

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

W Ren, H Yang, J Min, C Wei, W Chen - arxiv preprint arxiv:2412.00927, 2024 - arxiv.org
Current large multimodal models (LMMs) face significant challenges in processing and
comprehending long-duration or high-resolution videos, which is mainly due to the lack of …