Transformer-based visual segmentation: A survey

X Li, H Ding, H Yuan, W Zhang, J Pang… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Visual segmentation seeks to partition images, video frames, or point clouds into multiple
segments or groups. This technique has numerous real-world applications, such as …

The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arxiv preprint arxiv …, 2024 - arxiv.org
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding

T Zhang, X Li, H Fei, H Yuan, S Wu, S Ji… - arxiv preprint arxiv …, 2024 - arxiv.org
Current universal segmentation methods demonstrate strong capabilities in pixel-level
image and video understanding. However, they lack reasoning abilities and cannot be …

Mg-llava: Towards multi-granularity visual instruction tuning

X Zhao, X Li, H Duan, H Huang, Y Li, K Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-modal large language models (MLLMs) have made significant strides in various visual
understanding tasks. However, the majority of these models are constrained to process low …

Auto cherry-picker: Learning from high-quality generative data driven by language

Y Chen, X Li, Y Li, Y Zeng, J Wu, X Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
Diffusion-based models have shown great potential in generating high-quality images with
various layouts, which can benefit downstream perception tasks. However, a fully automatic …

TSCnet: A text-driven semantic-level controllable framework for customized low-light image enhancement

M Zhang, J Yin, P Zeng, Y Shen, S Lu, X Wang - Neurocomputing, 2025 - Elsevier
Deep learning-based image enhancement methods show significant advantages in
reducing noise and improving visibility in low-light conditions. These methods are typically …

LLAVADI: What Matters For Multimodal Large Language Models Distillation

S Xu, X Li, H Yuan, L Qi, Y Tong, MH Yang - arxiv preprint arxiv …, 2024 - arxiv.org
The recent surge in Multimodal Large Language Models (MLLMs) has showcased their
remarkable potential for achieving generalized intelligence by integrating visual …

Visual Large Language Models for Generalized and Specialized Applications

Y Li, Z Lai, W Bao, Z Tan, A Dao, K Sui, J Shen… - arxiv preprint arxiv …, 2025 - arxiv.org
Visual-language models (VLM) have emerged as a powerful tool for learning a unified
embedding space for vision and language. Inspired by large language models, which have …