- Academic Search

L Chen, X Wei, J Li, X Dong, P Zhang… - Advances in …, 2025 - proceedings.neurips.cc

Abstract We present the ShareGPT4Video series, aiming to facilitate the video
understanding of large video-language models (LVLMs) and the video generation of text-to …

Zapisz Cytuj Cytowane przez 101 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Are we on the right way for evaluating large vision-language models?

L Chen, J Li, X Dong, P Zhang, Y Zang, Z Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

Large vision-language models (LVLMs) have recently achieved rapid progress, sparking
numerous studies to evaluate their multi-modal capabilities. However, we dig into current …

Zapisz Cytuj Cytowane przez 174 Powiązane artykuły Wszystkie wersje 4 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mova: Adapting mixture of vision experts to multimodal context

Z Zong, B Ma, D Shen, G Song, H Shao, D Jiang… - arxiv preprint arxiv …, 2024 - arxiv.org

As the key component in multimodal large language models (MLLMs), the ability of the
visual encoder greatly affects MLLM's understanding on diverse image content. Although …

Zapisz Cytuj Cytowane przez 36 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Calibrated self-rewarding vision language models

Y Zhou, Z Fan, D Cheng, S Yang, Z Chen, C Cui… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-
trained large language models (LLMs) and vision models through instruction tuning. Despite …

Zapisz Cytuj Cytowane przez 32 Powiązane artykuły Wszystkie wersje 4 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

D Chen, R Chen, S Pu, Z Liu, Y Wu, C Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

Many real-world user queries (eg" How do to make egg fried rice?") could benefit from
systems capable of generating responses with both textual steps with accompanying …

Zapisz Cytuj Cytowane przez 4 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

FaceXBench: Evaluating Multimodal LLMs on Face Understanding

K Narayan, V VS, VM Patel - arxiv preprint arxiv:2501.10360, 2025 - arxiv.org

Multimodal Large Language Models (MLLMs) demonstrate impressive problem-solving
abilities across a wide range of tasks and domains. However, their capacity for face …

Zapisz Cytuj Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

K Zhu, P **a, Y Li, H Zhu, S Wang, H Yao - arxiv preprint arxiv …, 2024 - arxiv.org

The advancement of Large Vision-Language Models (LVLMs) has propelled their
application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality …

Zapisz Cytuj Cytowane przez 1 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

Mmie: Massive multimodal interleaved comprehension benchmark for large vision-language models

Sharegpt4video: Improving video understanding and generation with better captions

Are we on the right way for evaluating large vision-language models?

Mova: Adapting mixture of vision experts to multimodal context

Calibrated self-rewarding vision language models

Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

FaceXBench: Evaluating Multimodal LLMs on Face Understanding

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization