Google Наука

C Wu, X Chen, Z Wu, Y Ma, X Liu, Z Pan, W Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …

Запазване Позоваване С позовавания в 28 Сродни статии Всички 2 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Benchmark evaluations, applications, and challenges of large vision language models: A survey

Z Li, X Wu, H Du, H Nghiem, G Shi - arxiv preprint arxiv:2501.02189, 2025 - arxiv.org

Multimodal Vision Language Models (VLMs) have emerged as a transformative technology
at the intersection of computer vision and natural language processing, enabling machines …

Запазване Позоваване С позовавания в 5 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] springer.com

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Z Gao, Z Chen, E Cui, Y Ren, W Wang, J Zhu, H Tian… - Visual Intelligence, 2024 - Springer

Multi-modal large language models (MLLMs) have demonstrated impressive performance in
vision-language tasks across a wide range of domains. However, the large model scale and …

Запазване Позоваване С позовавания в 9 Сродни статии Всички 3 версии

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Enhancing the reasoning ability of multimodal large language models via mixed preference optimization

W Wang, Z Chen, W Wang, Y Cao, Y Liu, Z Gao… - arxiv preprint arxiv …, 2024 - arxiv.org

Existing open-source multimodal large language models (MLLMs) generally follow a
training process involving pre-training and supervised fine-tuning. However, these models …

Запазване Позоваване С позовавания в 10 Сродни статии Всички 2 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

J Han, J Liu, Y Jiang, B Yan, Y Zhang, Z Yuan… - arxiv preprint arxiv …, 2024 - arxiv.org

We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-
resolution, photorealistic images following language instruction. Infinity redefines visual …

Запазване Позоваване С позовавания в 9 Сродни статии Всички 2 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

In-context lora for diffusion transformers

L Huang, W Wang, ZF Wu, Y Shi, H Dou… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent research arxiv: 2410.15027 has explored the use of diffusion transformers (DiTs) for
task-agnostic image generation by simply concatenating attention tokens across images …

Запазване Позоваване С позовавания в 6 Сродни статии Всички 2 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Tokenflow: Unified image tokenizer for multimodal understanding and generation

L Qu, H Zhang, Y Liu, X Wang, Y Jiang, Y Gao… - arxiv preprint arxiv …, 2024 - arxiv.org

We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap
between multimodal understanding and generation. Prior research attempt to employ a …

Запазване Позоваване С позовавания в 4 Сродни статии Всички 2 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Metamorph: Multimodal understanding and generation via instruction tuning

S Tong, D Fan, J Zhu, Y **ong, X Chen, K Sinha… - arxiv preprint arxiv …, 2024 - arxiv.org

In this work, we propose Visual-Predictive Instruction Tuning (VPiT)-a simple and effective
extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an …

Запазване Позоваване С позовавания в 6 Сродни статии Всички 2 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Janus-pro: Unified multimodal understanding and generation with data and model scaling

X Chen, Z Wu, X Liu, Z Pan, W Liu, Z **e, X Yu… - arxiv preprint arxiv …, 2025 - arxiv.org

In this work, we introduce Janus-Pro, an advanced version of the previous work Janus.
Specifically, Janus-Pro incorporates (1) an optimized training strategy,(2) expanded training …

Запазване Позоваване С позовавания в 8 Сродни статии Всички 2 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Y Ma, X Liu, X Chen, W Liu, C Wu, Z Wu, Z Pan… - arxiv preprint arxiv …, 2024 - arxiv.org

We present JanusFlow, a powerful framework that unifies image understanding and
generation in a single model. JanusFlow introduces a minimalist architecture that integrates …

Запазване Позоваване С позовавания в 4 Сродни статии Всички 2 версии Във вид на HTML

Създаване на сигнал

Позоваване

Разширено търсене

Запазено в „Моята библиотека“

Emu3: Next-token prediction is all you need

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Benchmark evaluations, applications, and challenges of large vision language models: A survey

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Enhancing the reasoning ability of multimodal large language models via mixed preference optimization

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

In-context lora for diffusion transformers

Tokenflow: Unified image tokenizer for multimodal understanding and generation

Metamorph: Multimodal understanding and generation via instruction tuning

Janus-pro: Unified multimodal understanding and generation with data and model scaling

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation