INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model
With advancements in data availability and computing resources, Multimodal Large
Language Models (MLLMs) have showcased capabilities across various fields. However …
Language Models (MLLMs) have showcased capabilities across various fields. However …
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
Image pyramids are widely adopted in top-performing methods to obtain multi-scale features
for precise visual perception and understanding. However, current image pyramids use the …
for precise visual perception and understanding. However, current image pyramids use the …
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Perception and understanding are two pillars of computer vision. While multimodal large
language models (MLLM) have demonstrated remarkable visual understanding capabilities …
language models (MLLM) have demonstrated remarkable visual understanding capabilities …
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
Large Multimodal Models (LMMs) have made significant breakthroughs with the
advancement of instruction tuning. However, while existing models can understand images …
advancement of instruction tuning. However, while existing models can understand images …
HumanVLM: Foundation for Human-Scene Vision-Language Model
Human-scene vision-language tasks are increasingly prevalent in diverse social
applications, yet recent advancements predominantly rely on models specifically tailored to …
applications, yet recent advancements predominantly rely on models specifically tailored to …
Expand VSR Benchmark for VLLM to Expertize in Spatial Rules
P **e, L Sun, B Liu, D Wang, X Zhang, C Sun… - arxiv preprint arxiv …, 2024 - arxiv.org
Distinguishing spatial relations is a basic part of human cognition which requires fine-
grained perception on cross-instance. Although benchmarks like MME, MMBench and …
grained perception on cross-instance. Although benchmarks like MME, MMBench and …
Diagnostics-LLaVA: A Visual Language Model for Domain-Specific Diagnostics of Equipment
The recent advancements in the area of Large language models (LLMs) has opened
horizons for conversational assistant-based intelligent models capable of interpreting …
horizons for conversational assistant-based intelligent models capable of interpreting …
Zoomer: Enhancing MLLM Performance with Adaptive Image Focus Optimization
Recent advancements in multimodal large language models (MLLMs) have broadened the
scope of vision-language tasks, excelling in applications like image captioning and …
scope of vision-language tasks, excelling in applications like image captioning and …