Google Академія

H Hu, T Yin, F Luan, Y Hu, H Tan, Z Xu, S Bi… - arxiv preprint arxiv …, 2024 - arxiv.org

We present Turbo3D, an ultra-fast text-to-3D system capable of generating high-quality
Gaussian splatting assets in under one second. Turbo3D employs a rapid 4-step, 4-view …

Зберегти Послатися Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers

C Mitra, B Huang, T Chai, Z Lin, A Arbelle… - arxiv preprint arxiv …, 2024 - arxiv.org

Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide
variety of vision-language (VL) tasks such as image captioning or visual question …

Зберегти Послатися Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

ICONS: Influence Consensus for Vision-Language Data Selection

X Wu, M **a, R Shao, Z Deng, PW Koh… - arxiv preprint arxiv …, 2024 - arxiv.org

Visual Instruction Tuning typically requires a large amount of vision-language training data.
This data often containing redundant information that increases computational costs without …

Зберегти Послатися Цитовано в 1 джерелах Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

VLM-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

J Zhang, D Yao, R Pi, PP Liang - arxiv preprint arxiv:2502.12084, 2025 - arxiv.org

Visually linking matching cues is a crucial ability in daily life, such as identifying the same
person in multiple photos based on their cues, even without knowing who they are. Despite …

Зберегти Послатися Пов’язані статті Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Probing Visual Language Priors in VLMs

T Luo, A Cao, G Lee, J Johnson, H Lee - arxiv preprint arxiv:2501.00569, 2024 - arxiv.org

Despite recent advances in Vision-Language Models (VLMs), many still over-rely on visual
language priors present in their training data rather than true visual reasoning. To examine …

Зберегти Послатися Цитовано в 1 джерелах Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?

J Li, J Mo, MD Vo, A Sugimoto, H Nakayama - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal Large Language Models (MLLMs) have made notable advances in visual
understanding, yet their abilities to recognize objects modified by specific attributes remain …

Зберегти Послатися Пов’язані статті Кількість версій: 3 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

vVLM: Exploring Visual Reasoning in VLMs against Language Priors

T Luo, A Cao, G Lee, J Johnson, H Lee - openreview.net

The intersection of vision and language presents challenges, as vision language models
(VLMs) may exploit language biases, reducing their reliance on visual input. To examine …

Зберегти Послатися Пов’язані статті Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] github.io

[PDF][PDF] Boosting Multimodal LLMs via Visual Token Supervision

Z Bao, M Liu, A Ramchandani, M Wang, F Juefei-Xu… - zpbao.github.io

Multimodal large language models (MLLMs) have shown impressive performance on tasks
requiring integrated visual and textual understanding. A key factor in their success is the …

Зберегти Послатися Пов’язані статті Показати у форматі HTML

Створити сповіщення

Послатися

Розширений пошук

Збережено в моїй бібліотеці

Naturalbench: Evaluating vision-language models on natural adversarial samples

Turbo3D: Ultra-fast Text-to-3D Generation

Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers

ICONS: Influence Consensus for Vision-Language Data Selection

VLM-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

Probing Visual Language Priors in VLMs

NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?

vVLM: Exploring Visual Reasoning in VLMs against Language Priors

[PDF][PDF] Boosting Multimodal LLMs via Visual Token Supervision