Turbo3D: Ultra-fast Text-to-3D Generation

H Hu, T Yin, F Luan, Y Hu, H Tan, Z Xu, S Bi… - arxiv preprint arxiv …, 2024 - arxiv.org
We present Turbo3D, an ultra-fast text-to-3D system capable of generating high-quality
Gaussian splatting assets in under one second. Turbo3D employs a rapid 4-step, 4-view …

Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers

C Mitra, B Huang, T Chai, Z Lin, A Arbelle… - arxiv preprint arxiv …, 2024 - arxiv.org
Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide
variety of vision-language (VL) tasks such as image captioning or visual question …

ICONS: Influence Consensus for Vision-Language Data Selection

X Wu, M **a, R Shao, Z Deng, PW Koh… - arxiv preprint arxiv …, 2024 - arxiv.org
Visual Instruction Tuning typically requires a large amount of vision-language training data.
This data often containing redundant information that increases computational costs without …

VLM-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

J Zhang, D Yao, R Pi, PP Liang - arxiv preprint arxiv:2502.12084, 2025 - arxiv.org
Visually linking matching cues is a crucial ability in daily life, such as identifying the same
person in multiple photos based on their cues, even without knowing who they are. Despite …

Probing Visual Language Priors in VLMs

T Luo, A Cao, G Lee, J Johnson, H Lee - arxiv preprint arxiv:2501.00569, 2024 - arxiv.org
Despite recent advances in Vision-Language Models (VLMs), many still over-rely on visual
language priors present in their training data rather than true visual reasoning. To examine …

NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?

J Li, J Mo, MD Vo, A Sugimoto, H Nakayama - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) have made notable advances in visual
understanding, yet their abilities to recognize objects modified by specific attributes remain …

vVLM: Exploring Visual Reasoning in VLMs against Language Priors

T Luo, A Cao, G Lee, J Johnson, H Lee - openreview.net
The intersection of vision and language presents challenges, as vision language models
(VLMs) may exploit language biases, reducing their reliance on visual input. To examine …

[PDF][PDF] Boosting Multimodal LLMs via Visual Token Supervision

Z Bao, M Liu, A Ramchandani, M Wang, F Juefei-Xu… - zpbao.github.io
Multimodal large language models (MLLMs) have shown impressive performance on tasks
requiring integrated visual and textual understanding. A key factor in their success is the …