Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Z Wu, X Chen, Z Pan, X Liu, W Liu, D Dai… - arxiv preprint arxiv …, 2024 - arxiv.org
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …

Wisdom: Improving multimodal sentiment analysis by fusing contextual world knowledge

W Wang, L Ding, L Shen, Y Luo, H Hu… - Proceedings of the 32nd …, 2024 - dl.acm.org
Multimodal Sentiment Analysis (MSA) focuses on leveraging multimodal signals for
understanding human sentiment. Most of the existing works rely on superficial information …

Fire: A dataset for feedback integration and refinement evaluation of multimodal models

P Li, Z Gao, B Zhang, T Yuan, Y Wu, M Harandi… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision language models (VLMs) have achieved impressive progress in diverse applications,
becoming a prevalent research direction. In this paper, we build FIRE, a feedback …

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

F Zhu, Z Liu, XY Ng, H Wu, W Wang, F Feng… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) have achieved remarkable performance in many
vision-language tasks, yet their capabilities in fine-grained visual understanding remain …

VisionArena: 230K Real World User-VLM Conversations with Preference Labels

C Chou, L Dunlap, K Mashita, K Mandal… - arxiv preprint arxiv …, 2024 - arxiv.org
With the growing adoption and capabilities of vision-language models (VLMs) comes the
need for benchmarks that capture authentic user-VLM interactions. In response, we create …

Enhancing Perception Capabilities of Multimodal LLMs with Training-free Fusion

Z Chen, J Hu, Z Deng, Y Wang, B Zhuang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision
encoders with language models. Existing methods to enhance the visual perception of …

HumanVLM: Foundation for Human-Scene Vision-Language Model

D Dai, X Long, L Yutang, Z Yuanhui, S **a - arxiv preprint arxiv …, 2024 - arxiv.org
Human-scene vision-language tasks are increasingly prevalent in diverse social
applications, yet recent advancements predominantly rely on models specifically tailored to …

How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?

S Lee, G Kim, J Kim, H Lee, H Chang, SH Park… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-Language adaptation (VL adaptation) transforms Large Language Models (LLMs)
into Large Vision-Language Models (LVLMs) for multimodal tasks, but this process often …

UHDF: Hallucination Detection Using Open Source Models Beyond Close Source Models Methods

D Liu, B Xu, Z Zhao, B Xu, M Yang - CCF International Conference on …, 2024 - Springer
With the emergence of multimodal large models, the problem of hallucination has been
plaguing their development and deployment. How to reliably detect the presence of …

Large Language Models: Testing Their Capabilities to Understand and Explain Spatial Concepts (Short Paper)

M Hojati, R Feick - … on Spatial Information Theory (COSIT 2024), 2024 - drops.dagstuhl.de
Abstract Interest in applying Large Language Models (LLMs), which use natural language
processing (NLP) to provide human-like responses to text-based questions, to geospatial …