- Academic Search

Z Wu, X Chen, Z Pan, X Liu, W Liu, D Dai… - arxiv preprint arxiv …, 2024 - arxiv.org

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …

Save Cite Cited by 8 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Wisdom: Improving multimodal sentiment analysis by fusing contextual world knowledge

W Wang, L Ding, L Shen, Y Luo, H Hu… - Proceedings of the 32nd …, 2024 - dl.acm.org

Multimodal Sentiment Analysis (MSA) focuses on leveraging multimodal signals for
understanding human sentiment. Most of the existing works rely on superficial information …

Save Cite Cited by 9 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Fire: A dataset for feedback integration and refinement evaluation of multimodal models

P Li, Z Gao, B Zhang, T Yuan, Y Wu, M Harandi… - arxiv preprint arxiv …, 2024 - arxiv.org

Vision language models (VLMs) have achieved impressive progress in diverse applications,
becoming a prevalent research direction. In this paper, we build FIRE, a feedback …

Save Cite Cited by 4 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

F Zhu, Z Liu, XY Ng, H Wu, W Wang, F Feng… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Vision-Language Models (LVLMs) have achieved remarkable performance in many
vision-language tasks, yet their capabilities in fine-grained visual understanding remain …

Save Cite Cited by 1 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

VisionArena: 230K Real World User-VLM Conversations with Preference Labels

C Chou, L Dunlap, K Mashita, K Mandal… - arxiv preprint arxiv …, 2024 - arxiv.org

With the growing adoption and capabilities of vision-language models (VLMs) comes the
need for benchmarks that capture authentic user-VLM interactions. In response, we create …

Save Cite Cited by 1 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Enhancing Perception Capabilities of Multimodal LLMs with Training-free Fusion

Z Chen, J Hu, Z Deng, Y Wang, B Zhuang… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision
encoders with language models. Existing methods to enhance the visual perception of …

[Free GPT-4]

[PDF] arxiv.org

HumanVLM: Foundation for Human-Scene Vision-Language Model

D Dai, X Long, L Yutang, Z Yuanhui, S **a - arxiv preprint arxiv …, 2024 - arxiv.org

Human-scene vision-language tasks are increasingly prevalent in diverse social
applications, yet recent advancements predominantly rely on models specifically tailored to …

[Free GPT-4]

[PDF] arxiv.org

How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?

S Lee, G Kim, J Kim, H Lee, H Chang, SH Park… - arxiv preprint arxiv …, 2024 - arxiv.org

Vision-Language adaptation (VL adaptation) transforms Large Language Models (LLMs)
into Large Vision-Language Models (LVLMs) for multimodal tasks, but this process often …

UHDF: Hallucination Detection Using Open Source Models Beyond Close Source Models Methods

D Liu, B Xu, Z Zhao, B Xu, M Yang - CCF International Conference on …, 2024 - Springer

With the emergence of multimodal large models, the problem of hallucination has been
plaguing their development and deployment. How to reliably detect the presence of …

[Free GPT-4]

[PDF] dagstuhl.de

Large Language Models: Testing Their Capabilities to Understand and Explain Spatial Concepts (Short Paper)

M Hojati, R Feick - … on Spatial Information Theory (COSIT 2024), 2024 - drops.dagstuhl.de

Abstract Interest in applying Large Language Models (LLMs), which use natural language
processing (NLP) to provide human-like responses to text-based questions, to geospatial …

Save Cite Related articles View as HTML

Create alert

Cite

Advanced search

Saved to My library

Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Wisdom: Improving multimodal sentiment analysis by fusing contextual world knowledge

Fire: A dataset for feedback integration and refinement evaluation of multimodal models

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

VisionArena: 230K Real World User-VLM Conversations with Preference Labels

Enhancing Perception Capabilities of Multimodal LLMs with Training-free Fusion

HumanVLM: Foundation for Human-Scene Vision-Language Model

How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?

UHDF: Hallucination Detection Using Open Source Models Beyond Close Source Models Methods

Large Language Models: Testing Their Capabilities to Understand and Explain Spatial Concepts (Short Paper)