Google Académico

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Guardar Citar Citado por 205 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]

[PDF] oup.com

A Survey of Multimodel Large Language Models

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org

With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Guardar Citar Citado por 153 Artículos relacionados Las 7 versiones

[Free GPT-4]

[PDF] thecvf.com

Vila: On pre-training for visual language models

J Lin, H Yin, W **, P Molchanov… - Proceedings of the …, 2024 - openaccess.thecvf.com

Visual language models (VLMs) rapidly progressed with the recent success of large
language models. There have been growing efforts on visual instruction tuning to extend the …

Guardar Citar Citado por 265 Artículos relacionados Las 4 versiones Versión en HTML

[Free GPT-4]

[PDF] thecvf.com

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Guardar Citar Citado por 103 Artículos relacionados Las 3 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Show-o: One single transformer to unify multimodal understanding and generation

J **e, W Mao, Z Bai, DJ Zhang, W Wang, KQ Lin… - arxiv preprint arxiv …, 2024 - arxiv.org

We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …

Guardar Citar Citado por 71 Artículos relacionados Las 3 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Vila-u: a unified foundation model integrating visual understanding and generation

Y Wu, Z Zhang, J Chen, H Tang, D Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org

VILA-U is a Unified foundation model that integrates Video, Image, Language understanding
and generation. Traditional visual language models (VLMs) use separate modules for …

Guardar Citar Citado por 34 Artículos relacionados Versión en HTML

World model on million-length video and language with ringattention

H Liu, W Yan, M Zaharia, P Abbeel - arxiv e-prints, 2024 - ui.adsabs.harvard.edu

Current language models fall short in understanding aspects of the world not easily
described in words, and struggle with complex, long-form tasks. Video sequences offer …

Guardar Citar Citado por 87 Artículos relacionados

[Free GPT-4]

[PDF] arxiv.org

Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities

E Yang, L Shen, G Guo, X Wang, X Cao… - arxiv preprint arxiv …, 2024 - arxiv.org

Model merging is an efficient empowerment technique in the machine learning community
that does not require the collection of raw training data and does not require expensive …

Guardar Citar Citado por 42 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Seed-story: Multimodal long story generation with large language model

S Yang, Y Ge, Y Li, Y Chen, Y Ge, Y Shan… - arxiv preprint arxiv …, 2024 - arxiv.org

With the remarkable advancements in image generation and open-form text generation, the
creation of interleaved image-text content has become an increasingly intriguing field …

Guardar Citar Citado por 24 Artículos relacionados Las 3 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Retrieving multimodal information for augmented generation: A survey

R Zhao, H Chen, W Wang, F Jiao, XL Do, C Qin… - arxiv preprint arxiv …, 2023 - arxiv.org

As Large Language Models (LLMs) become popular, there emerged an important trend of
using multimodality to augment the LLMs' generation ability, which enables LLMs to better …

Guardar Citar Citado por 60 Artículos relacionados Las 5 versiones Versión en HTML

Crear alerta

Citar

Búsqueda avanzada

Guardado en Mi biblioteca

Jointly training large autoregressive multimodal models

Mm-llms: Recent advances in multimodal large language models

A Survey of Multimodel Large Language Models

Vila: On pre-training for visual language models

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Show-o: One single transformer to unify multimodal understanding and generation

Vila-u: a unified foundation model integrating visual understanding and generation

World model on million-length video and language with ringattention

Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities

Seed-story: Multimodal long story generation with large language model

Retrieving multimodal information for augmented generation: A survey