- Academic Search

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Speichern Zitieren Zitiert von: 68 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Janus: Decoupling visual encoding for unified multimodal understanding and generation

C Wu, X Chen, Z Wu, Y Ma, X Liu, Z Pan, W Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …

Speichern Zitieren Zitiert von: 22 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Z Wu, X Chen, Z Pan, X Liu, W Liu, D Dai… - arxiv preprint arxiv …, 2024 - arxiv.org

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …

Speichern Zitieren Zitiert von: 8 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

A survey on multimodal benchmarks: In the era of large ai models

L Li, G Chen, H Shi, J **ao, L Chen - arxiv preprint arxiv:2409.18142, 2024 - arxiv.org

The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …

Speichern Zitieren Zitiert von: 4 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Y Ma, X Liu, X Chen, W Liu, C Wu, Z Wu, Z Pan… - arxiv preprint arxiv …, 2024 - arxiv.org

We present JanusFlow, a powerful framework that unifies image understanding and
generation in a single model. JanusFlow introduces a minimalist architecture that integrates …

Speichern Zitieren Zitiert von: 2 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] thecvf.com

Questioning, Answering, and Captioning for Zero-Shot Detailed Image Caption

DT Luu, VT Le, DM Vo - Proceedings of the Asian …, 2024 - openaccess.thecvf.com

End-to-end pre-trained large vision language models (VLMs) have made unprecedented
progress in image captioning. Nonetheless, they struggle to generate detailed captions …

Speichern Zitieren Zitiert von: 1 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use

Z Zhang, R Rossi, T Yu, F Dernoncourt… - arxiv preprint arxiv …, 2024 - arxiv.org

While vision-language models (VLMs) have demonstrated remarkable performance across
various tasks combining textual and visual information, they continue to struggle with fine …

Speichern Zitieren Zitiert von: 1 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Ocean-OCR: Towards General OCR Application via a Vision-Language Model

S Chen, X Guo, Y Li, T Zhang, M Lin, D Kuang… - arxiv preprint arxiv …, 2025 - arxiv.org

Multimodal large language models (MLLMs) have shown impressive capabilities across
various domains, excelling in processing and understanding information from multiple …

Speichern Zitieren Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Baichuan-Omni-1.5 Technical Report

Y Li, J Liu, T Zhang, S Chen, T Li, Z Li, L Liu… - arxiv preprint arxiv …, 2025 - arxiv.org

We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal
understanding capabilities but also provides end-to-end audio generation capabilities. To …

Speichern Zitieren Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models

Z Li, G Chen, S Liu, S Wang, V VS, Y Ji, S Lan… - arxiv preprint arxiv …, 2025 - arxiv.org

Recently, promising progress has been made by open-source vision-language models
(VLMs) in bringing their capabilities closer to those of proprietary frontier models. However …

Speichern Zitieren Ähnliche Artikel Alle 2 Versionen HTML-Version

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

Densefusion-1m: Merging vision experts for comprehensive multimodal perception

Emu3: Next-token prediction is all you need

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

A survey on multimodal benchmarks: In the era of large ai models

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Questioning, Answering, and Captioning for Zero-Shot Detailed Image Caption

VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use

Ocean-OCR: Towards General OCR Application via a Vision-Language Model

Baichuan-Omni-1.5 Technical Report

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models