Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …
Wisdom: Improving multimodal sentiment analysis by fusing contextual world knowledge
Multimodal Sentiment Analysis (MSA) focuses on leveraging multimodal signals for
understanding human sentiment. Most of the existing works rely on superficial information …
understanding human sentiment. Most of the existing works rely on superficial information …
Fire: A dataset for feedback integration and refinement evaluation of multimodal models
Vision language models (VLMs) have achieved impressive progress in diverse applications,
becoming a prevalent research direction. In this paper, we build FIRE, a feedback …
becoming a prevalent research direction. In this paper, we build FIRE, a feedback …
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding
Large Vision-Language Models (LVLMs) have achieved remarkable performance in many
vision-language tasks, yet their capabilities in fine-grained visual understanding remain …
vision-language tasks, yet their capabilities in fine-grained visual understanding remain …
VisionArena: 230K Real World User-VLM Conversations with Preference Labels
With the growing adoption and capabilities of vision-language models (VLMs) comes the
need for benchmarks that capture authentic user-VLM interactions. In response, we create …
need for benchmarks that capture authentic user-VLM interactions. In response, we create …
Enhancing Perception Capabilities of Multimodal LLMs with Training-free Fusion
Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision
encoders with language models. Existing methods to enhance the visual perception of …
encoders with language models. Existing methods to enhance the visual perception of …
Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment
While diffusion models are powerful in generating high-quality, diverse synthetic data for
object-centric tasks, existing methods struggle with scene-aware tasks such as Visual …
object-centric tasks, existing methods struggle with scene-aware tasks such as Visual …
HumanVLM: Foundation for Human-Scene Vision-Language Model
Human-scene vision-language tasks are increasingly prevalent in diverse social
applications, yet recent advancements predominantly rely on models specifically tailored to …
applications, yet recent advancements predominantly rely on models specifically tailored to …
How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?
Vision-Language adaptation (VL adaptation) transforms Large Language Models (LLMs)
into Large Vision-Language Models (LVLMs) for multimodal tasks, but this process often …
into Large Vision-Language Models (LVLMs) for multimodal tasks, but this process often …
Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs
Understanding time from visual representations is a fundamental cognitive skill, yet it
remains a challenge for multimodal large language models (MLLMs). In this work, we …
remains a challenge for multimodal large language models (MLLMs). In this work, we …