Google Tudós

Y Ma, Z Wang, X Sun, W Lin, Q Zhou, J Ji… - arxiv preprint arxiv …, 2024 - arxiv.org

With advancements in data availability and computing resources, Multimodal Large
Language Models (MLLMs) have showcased capabilities across various fields. However …

Mentés Hivatkozás Idézetek száma: 1 Kapcsolódó cikkek Mind a(z) 3 változat HTML-változat

[Free GPT-4]

[PDF] arxiv.org

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Z Wang, X Zhu, X Yang, G Luo, H Li, C Tian… - arxiv preprint arxiv …, 2025 - arxiv.org

Image pyramids are widely adopted in top-performing methods to obtain multi-scale features
for precise visual perception and understanding. However, current image pyramids use the …

Mentés Hivatkozás Kapcsolódó cikkek Mind a(z) 2 változat HTML-változat

[Free GPT-4]

[PDF] arxiv.org

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

Q Jiang, Y Yang, Y **ong, Y Chen, Z Zeng… - arxiv preprint arxiv …, 2024 - arxiv.org

Perception and understanding are two pillars of computer vision. While multimodal large
language models (MLLM) have demonstrated remarkable visual understanding capabilities …

Mentés Hivatkozás Kapcsolódó cikkek Mind a(z) 2 változat HTML-változat

[Free GPT-4]

[PDF] arxiv.org

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

W Peng, L Meng, Y Chen, Y **e, Y Liu, T Gui… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Multimodal Models (LMMs) have made significant breakthroughs with the
advancement of instruction tuning. However, while existing models can understand images …

Mentés Hivatkozás Kapcsolódó cikkek Mind a(z) 2 változat HTML-változat

[Free GPT-4]

[PDF] arxiv.org

HumanVLM: Foundation for Human-Scene Vision-Language Model

D Dai, X Long, L Yutang, Z Yuanhui, S **a - arxiv preprint arxiv …, 2024 - arxiv.org

Human-scene vision-language tasks are increasingly prevalent in diverse social
applications, yet recent advancements predominantly rely on models specifically tailored to …

Mentés Hivatkozás Kapcsolódó cikkek Mind a(z) 2 változat HTML-változat

[Free GPT-4]

[PDF] arxiv.org

Expand VSR Benchmark for VLLM to Expertize in Spatial Rules

P **e, L Sun, B Liu, D Wang, X Zhang, C Sun… - arxiv preprint arxiv …, 2024 - arxiv.org

Distinguishing spatial relations is a basic part of human cognition which requires fine-
grained perception on cross-instance. Although benchmarks like MME, MMBench and …

Mentés Hivatkozás Kapcsolódó cikkek Mind a(z) 2 változat HTML-változat

Diagnostics-LLaVA: A Visual Language Model for Domain-Specific Diagnostics of Equipment

A Kumar, M Alam, A Farahat… - … Conference of the …, 2024 - papers.phmsociety.org

The recent advancements in the area of Large language models (LLMs) has opened
horizons for conversational assistant-based intelligent models capable of interpreting …

Mentés Hivatkozás Kapcsolódó cikkek Mind a(z) 2 változat Tárolt változat

[Free GPT-4]

[PDF] openreview.net

Zoomer: Enhancing MLLM Performance with Adaptive Image Focus Optimization

J Qian, C Wang, Y Yang, C Zhang, H Jiang, X Luo… - openreview.net

Recent advancements in multimodal large language models (MLLMs) have broadened the
scope of vision-language tasks, excelling in applications like image captioning and …

Mentés Hivatkozás Kapcsolódó cikkek HTML-változat

Értesítés létrehozása

Hivatkozás

Speciális keresés

Mentve a Saját könyvtárba

Mg-llava: Towards multi-granularity visual instruction tuning

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

HumanVLM: Foundation for Human-Scene Vision-Language Model

Expand VSR Benchmark for VLLM to Expertize in Spatial Rules

Diagnostics-LLaVA: A Visual Language Model for Domain-Specific Diagnostics of Equipment

Zoomer: Enhancing MLLM Performance with Adaptive Image Focus Optimization