INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Y Ma, Z Wang, X Sun, W Lin, Q Zhou, J Ji… - arxiv preprint arxiv …, 2024 - arxiv.org
With advancements in data availability and computing resources, Multimodal Large
Language Models (MLLMs) have showcased capabilities across various fields. However …

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Z Wang, X Zhu, X Yang, G Luo, H Li, C Tian… - arxiv preprint arxiv …, 2025 - arxiv.org
Image pyramids are widely adopted in top-performing methods to obtain multi-scale features
for precise visual perception and understanding. However, current image pyramids use the …

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

Q Jiang, Y Yang, Y **ong, Y Chen, Z Zeng… - arxiv preprint arxiv …, 2024 - arxiv.org
Perception and understanding are two pillars of computer vision. While multimodal large
language models (MLLM) have demonstrated remarkable visual understanding capabilities …

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

W Peng, L Meng, Y Chen, Y **e, Y Liu, T Gui… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Multimodal Models (LMMs) have made significant breakthroughs with the
advancement of instruction tuning. However, while existing models can understand images …

HumanVLM: Foundation for Human-Scene Vision-Language Model

D Dai, X Long, L Yutang, Z Yuanhui, S **a - arxiv preprint arxiv …, 2024 - arxiv.org
Human-scene vision-language tasks are increasingly prevalent in diverse social
applications, yet recent advancements predominantly rely on models specifically tailored to …

Expand VSR Benchmark for VLLM to Expertize in Spatial Rules

P **e, L Sun, B Liu, D Wang, X Zhang, C Sun… - arxiv preprint arxiv …, 2024 - arxiv.org
Distinguishing spatial relations is a basic part of human cognition which requires fine-
grained perception on cross-instance. Although benchmarks like MME, MMBench and …

Diagnostics-LLaVA: A Visual Language Model for Domain-Specific Diagnostics of Equipment

A Kumar, M Alam, A Farahat… - … Conference of the …, 2024 - papers.phmsociety.org
The recent advancements in the area of Large language models (LLMs) has opened
horizons for conversational assistant-based intelligent models capable of interpreting …

Zoomer: Enhancing MLLM Performance with Adaptive Image Focus Optimization

J Qian, C Wang, Y Yang, C Zhang, H Jiang, X Luo… - openreview.net
Recent advancements in multimodal large language models (MLLMs) have broadened the
scope of vision-language tasks, excelling in applications like image captioning and …