Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

A Survey of Multimodel Large Language Models

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org
With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Vila: On pre-training for visual language models

J Lin, H Yin, W **, P Molchanov… - Proceedings of the …, 2024 - openaccess.thecvf.com
Visual language models (VLMs) rapidly progressed with the recent success of large
language models. There have been growing efforts on visual instruction tuning to extend the …

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Show-o: One single transformer to unify multimodal understanding and generation

J **e, W Mao, Z Bai, DJ Zhang, W Wang, KQ Lin… - arxiv preprint arxiv …, 2024 - arxiv.org
We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …

Vila-u: a unified foundation model integrating visual understanding and generation

Y Wu, Z Zhang, J Chen, H Tang, D Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org
VILA-U is a Unified foundation model that integrates Video, Image, Language understanding
and generation. Traditional visual language models (VLMs) use separate modules for …

World model on million-length video and language with ringattention

H Liu, W Yan, M Zaharia, P Abbeel - arxiv e-prints, 2024 - ui.adsabs.harvard.edu
Current language models fall short in understanding aspects of the world not easily
described in words, and struggle with complex, long-form tasks. Video sequences offer …

Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities

E Yang, L Shen, G Guo, X Wang, X Cao… - arxiv preprint arxiv …, 2024 - arxiv.org
Model merging is an efficient empowerment technique in the machine learning community
that does not require the collection of raw training data and does not require expensive …

Seed-story: Multimodal long story generation with large language model

S Yang, Y Ge, Y Li, Y Chen, Y Ge, Y Shan… - arxiv preprint arxiv …, 2024 - arxiv.org
With the remarkable advancements in image generation and open-form text generation, the
creation of interleaved image-text content has become an increasingly intriguing field …

Retrieving multimodal information for augmented generation: A survey

R Zhao, H Chen, W Wang, F Jiao, XL Do, C Qin… - arxiv preprint arxiv …, 2023 - arxiv.org
As Large Language Models (LLMs) become popular, there emerged an important trend of
using multimodality to augment the LLMs' generation ability, which enables LLMs to better …