Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
The revolution of multimodal large language models: a survey
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …
this reason, inspired by the success of large language models, significant research efforts …
Controlmllm: Training-free visual prompt learning for multimodal large language models
In this work, we propose a training-free method to inject visual prompts into Multimodal
Large Language Models (MLLMs) through learnable latent variable optimization. We …
Large Language Models (MLLMs) through learnable latent variable optimization. We …
Investigating and mitigating the multimodal hallucination snowballing in large vision-language models
Though advanced in understanding visual information with human languages, Large Vision-
Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is …
Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is …
[PDF][PDF] Artemis: Towards referential understanding in complex videos
J Qiu, Y Zhang, X Tang, L **e, T Ma… - The Thirty-eighth …, 2024 - proceedings.neurips.cc
Videos carry rich visual information including object description, action, interaction, etc., but
the existing multimodal large language models (MLLMs) fell short in referential …
the existing multimodal large language models (MLLMs) fell short in referential …
Videorefer suite: Advancing spatial-temporal object understanding with video llm
Video Large Language Models (Video LLMs) have recently exhibited remarkable
capabilities in general video understanding. However, they mainly focus on holistic …
capabilities in general video understanding. However, they mainly focus on holistic …
ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension
T Ma, L **e, Y Tian, B Yang, Q Ye - arxiv preprint arxiv:2406.11327, 2024 - arxiv.org
Aligning vision and language concepts at a finer level remains an essential topic of
multimodal large language models (MLLMs), particularly for tasks such as referring and …
multimodal large language models (MLLMs), particularly for tasks such as referring and …
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
We present Omni-RGPT, a multimodal large language model designed to facilitate region-
level comprehension for both images and videos. To achieve consistent region …
level comprehension for both images and videos. To achieve consistent region …
Personalized Large Vision-Language Models
The personalization model has gained significant attention in image generation yet remains
underexplored for large vision-language models (LVLMs). Beyond generic ones, with …
underexplored for large vision-language models (LVLMs). Beyond generic ones, with …
Visual Large Language Models for Generalized and Specialized Applications
Visual-language models (VLM) have emerged as a powerful tool for learning a unified
embedding space for vision and language. Inspired by large language models, which have …
embedding space for vision and language. Inspired by large language models, which have …
Exploring Spatial Language Grounding Through Referring Expressions
A Tumu, P Kordjamshidi - arxiv preprint arxiv:2502.04359, 2025 - arxiv.org
Spatial Reasoning is an important component of human cognition and is an area in which
the latest Vision-language models (VLMs) show signs of difficulty. The current analysis …
the latest Vision-language models (VLMs) show signs of difficulty. The current analysis …