- Academic Search

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer

In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Speichern Zitieren Zitiert von: 356 Ähnliche Artikel Alle 2 Versionen

[Free GPT-4]

[PDF] arxiv.org

Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Speichern Zitieren Zitiert von: 72 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Show-o: One single transformer to unify multimodal understanding and generation

J **e, W Mao, Z Bai, DJ Zhang, W Wang, KQ Lin… - arxiv preprint arxiv …, 2024 - arxiv.org

We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …

Speichern Zitieren Zitiert von: 77 Ähnliche Artikel Alle 4 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Shapellm: Universal 3d object understanding for embodied interaction

Z Qi, R Dong, S Zhang, H Geng, C Han, Z Ge… - … on Computer Vision, 2024 - Springer

This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM)
designed for embodied interaction, exploring a universal 3D object understanding with 3D …

Speichern Zitieren Zitiert von: 40 Ähnliche Artikel Alle 2 Versionen

[Free GPT-4]

[PDF] arxiv.org

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

D Liu, S Zhao, L Zhuo, W Lin, Y Qiao, H Li… - arxiv preprint arxiv …, 2024 - arxiv.org

We present Lumina-mGPT, a family of multimodal autoregressive models capable of various
vision and language tasks, particularly excelling in generating flexible photorealistic images …

Speichern Zitieren Zitiert von: 28 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Densefusion-1m: Merging vision experts for comprehensive multimodal perception

X Li, F Zhang, H Diao, Y Wang, X Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Existing Multimodal Large Language Models (MLLMs) increasingly emphasize complex
understanding of various visual elements, including multiple objects, text information, and …

Speichern Zitieren Zitiert von: 17 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Brain-inspired Artificial Intelligence: A Comprehensive Review

J Ren, F **a - arxiv preprint arxiv:2408.14811, 2024 - arxiv.org

Current artificial intelligence (AI) models often focus on enhancing performance through
meticulous parameter tuning and optimization techniques. However, the fundamental design …

Speichern Zitieren Zitiert von: 4 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Speichern Zitieren Zitiert von: 2 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Dreambench++: A human-aligned benchmark for personalized image generation

Y Peng, Y Cui, H Tang, Z Qi, R Dong, J Bai… - arxiv preprint arxiv …, 2024 - arxiv.org

Personalized image generation holds great promise in assisting humans in everyday work
and life due to its impressive function in creatively generating personalized content …

Speichern Zitieren Zitiert von: 14 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Harmonizing visual text comprehension and generation

Z Zhao, J Tang, B Wu, C Lin, S Wei, H Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

In this work, we present TextHarmony, a unified and versatile multimodal generative model
proficient in comprehending and generating visual text. Simultaneously generating images …

Speichern Zitieren Zitiert von: 11 Ähnliche Artikel Alle 3 Versionen HTML-Version

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

Emu: Generative pretraining in multimodality

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Emu3: Next-token prediction is all you need

Show-o: One single transformer to unify multimodal understanding and generation

Shapellm: Universal 3d object understanding for embodied interaction

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

Densefusion-1m: Merging vision experts for comprehensive multimodal perception

Brain-inspired Artificial Intelligence: A Comprehensive Review

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Dreambench++: A human-aligned benchmark for personalized image generation

Harmonizing visual text comprehension and generation