The revolution of multimodal large language models: a survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arxiv preprint arxiv …, 2024 - arxiv.org
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding

T Zhang, X Li, H Fei, H Yuan, S Wu… - Advances in …, 2025 - proceedings.neurips.cc
Current universal segmentation methods demonstrate strong capabilities in pixel-level
image and video understanding. However, they lack reasoning abilities and cannot be …

Gsva: Generalized segmentation via multimodal large language models

Z **a, D Han, Y Han, X Pan, S Song… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Generalized Referring Expression Segmentation (GRES) extends the scope of
classic RES to refer to multiple objects in one expression or identify the empty targets absent …

Ferret-v2: An improved baseline for referring and grounding with large language models

H Zhang, H You, P Dufter, B Zhang, C Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
While Ferret seamlessly integrates regional understanding into the Large Language Model
(LLM) to facilitate its referring and grounding capability, it poses certain limitations …

Spin: Hierarchical segmentation with subpart granularity in natural images

J Myers-Dean, J Reynolds, B Price, Y Fan… - European Conference on …, 2024 - Springer
Hierarchical segmentation entails creating segmentations at varying levels of granularity.
We introduce the first hierarchical semantic segmentation dataset with subpart annotations …

Lasagna: Language-based segmentation assistant for complex queries

C Wei, H Tan, Y Zhong, Y Yang, L Ma - arxiv preprint arxiv:2404.08506, 2024 - arxiv.org
Recent advancements have empowered Large Language Models for Vision (vLLMs) to
generate detailed perceptual outcomes, including bounding boxes and masks. Nonetheless …

Selective" Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning

T Srinivasan, J Hessel, T Gupta, BY Lin, Y Choi… - arxiv preprint arxiv …, 2024 - arxiv.org
Selective prediction minimizes incorrect predictions from vision-language models (VLMs) by
allowing them to abstain from answering when uncertain. However, when deploying a vision …

Reasoning to Attend: Try to Understand How< SEG> Token Works

R Qian, X Yin, D Dou - arxiv preprint arxiv:2412.17741, 2024 - arxiv.org
Current Large Multimodal Models (LMMs) empowered visual grounding typically rely on
$\texttt {< SEG>} $ token as a text prompt to jointly optimize the vision-language model (eg …

EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM

Q Nguyen, T Vu, TT Nguyen, Y Wen… - arxiv preprint arxiv …, 2024 - arxiv.org
Image editing technologies are tools used to transform, adjust, remove, or otherwise alter
images. Recent research has significantly improved the capabilities of image editing tools …

SegLLM: Multi-round Reasoning Segmentation

XD Wang, S Zhang, S Li, K Kallidromitis, K Li… - arxiv preprint arxiv …, 2024 - arxiv.org
We present SegLLM, a novel multi-round interactive reasoning segmentation model that
enhances LLM-based segmentation by exploiting conversational memory of both visual and …