- Academic Search

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arxiv preprint arxiv …, 2024 - arxiv.org

Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Opslaan Citeren Geciteerd door 46 Verwante artikelen Alle 9 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Controlmllm: Training-free visual prompt learning for multimodal large language models

M Wu, X Cai, J Ji, J Li, O Huang… - Advances in …, 2025 - proceedings.neurips.cc

In this work, we propose a training-free method to inject visual prompts into Multimodal
Large Language Models (MLLMs) through learnable latent variable optimization. We …

Opslaan Citeren Geciteerd door 9 Verwante artikelen Alle 5 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Investigating and mitigating the multimodal hallucination snowballing in large vision-language models

W Zhong, X Feng, L Zhao, Q Li, L Huang, Y Gu… - arxiv preprint arxiv …, 2024 - arxiv.org

Though advanced in understanding visual information with human languages, Large Vision-
Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is …

Opslaan Citeren Geciteerd door 7 Verwante artikelen Alle 6 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

[PDF][PDF] Artemis: Towards referential understanding in complex videos

J Qiu, Y Zhang, X Tang, L **e, T Ma… - The Thirty-eighth …, 2024 - proceedings.neurips.cc

Videos carry rich visual information including object description, action, interaction, etc., but
the existing multimodal large language models (MLLMs) fell short in referential …

Opslaan Citeren Geciteerd door 5 Verwante artikelen Alle 4 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Videorefer suite: Advancing spatial-temporal object understanding with video llm

Y Yuan, H Zhang, W Li, Z Cheng, B Zhang, L Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Video Large Language Models (Video LLMs) have recently exhibited remarkable
capabilities in general video understanding. However, they mainly focus on holistic …

Opslaan Citeren Geciteerd door 1 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension

T Ma, L **e, Y Tian, B Yang, Q Ye - arxiv preprint arxiv:2406.11327, 2024 - arxiv.org

Aligning vision and language concepts at a finer level remains an essential topic of
multimodal large language models (MLLMs), particularly for tasks such as referring and …

Opslaan Citeren Verwante artikelen HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

M Heo, MH Chen, DA Huang, S Liu… - arxiv preprint arxiv …, 2025 - arxiv.org

We present Omni-RGPT, a multimodal large language model designed to facilitate region-
level comprehension for both images and videos. To achieve consistent region …

Opslaan Citeren Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Personalized Large Vision-Language Models

C Pham, H Phan, D Doermann, Y Tian - arxiv preprint arxiv:2412.17610, 2024 - arxiv.org

The personalization model has gained significant attention in image generation yet remains
underexplored for large vision-language models (LVLMs). Beyond generic ones, with …

Opslaan Citeren Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Visual Large Language Models for Generalized and Specialized Applications

Y Li, Z Lai, W Bao, Z Tan, A Dao, K Sui, J Shen… - arxiv preprint arxiv …, 2025 - arxiv.org

Visual-language models (VLM) have emerged as a powerful tool for learning a unified
embedding space for vision and language. Inspired by large language models, which have …

Opslaan Citeren Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Exploring Spatial Language Grounding Through Referring Expressions

A Tumu, P Kordjamshidi - arxiv preprint arxiv:2502.04359, 2025 - arxiv.org

Spatial Reasoning is an important component of human cognition and is an area in which
the latest Vision-language models (VLMs) show signs of difficulty. The current analysis …

Opslaan Citeren Verwante artikelen HTML-versie

Melding maken

Citeren

Geavanceerd zoeken

Opgeslagen in Mijn bibliotheek

Chatterbox: Multi-round multimodal referring and grounding

The revolution of multimodal large language models: a survey

Controlmllm: Training-free visual prompt learning for multimodal large language models

Investigating and mitigating the multimodal hallucination snowballing in large vision-language models

[PDF][PDF] Artemis: Towards referential understanding in complex videos

Videorefer suite: Advancing spatial-temporal object understanding with video llm

ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Personalized Large Vision-Language Models

Visual Large Language Models for Generalized and Specialized Applications

Exploring Spatial Language Grounding Through Referring Expressions