Google Академія

J Yang, Y Dong, S Liu, B Li, Z Wang, H Tan… - … on Computer Vision, 2024 - Springer

Large vision-language models (VLMs) have achieved substantial progress in multimodal
perception and reasoning. When integrated into an embodied agent, existing embodied …

Зберегти Послатися Цитовано в 47 джерелах Пов’язані статті Кількість версій: 7

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution

Z Liu, Y Dong, Z Liu, W Hu, J Lu, Y Rao - ar** high-resolution document images into multiple sub-images is the most widely
used approach for current Multimodal Large Language Models (MLLMs) to do document …

Зберегти Послатися Цитовано в 4 джерелах Пов’язані статті Кількість версій: 4 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

S **e, L Kong, Y Dong, C Sima, W Zhang… - arxiv preprint arxiv …, 2025 - arxiv.org

Recent advancements in Vision-Language Models (VLMs) have sparked interest in their use
for autonomous driving, particularly in generating interpretable driving decisions through …

Зберегти Послатися Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning

X Wu, Y Ding, B Li, P Lu, D Yin, KW Chang… - arxiv preprint arxiv …, 2024 - arxiv.org

The ability of large vision-language models (LVLMs) to critique and correct their reasoning is
an essential building block towards their self-improvement. However, a systematic analysis …

Зберегти Послатися Пов’язані статті Кількість версій: 2 Показати у форматі HTML

Створити сповіщення

Послатися

Розширений пошук

Збережено в моїй бібліотеці

Chain-of-spot: Interactive reasoning improves large vision-language models

Octopus: Embodied vision-language programmer from environmental feedback

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning