Mm-llms: Recent advances in multimodal large language models
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …
A Survey of Multimodel Large Language Models
Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org
With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …
including vision, the technology of large language models is evolving from a single modality …
Vila: On pre-training for visual language models
Visual language models (VLMs) rapidly progressed with the recent success of large
language models. There have been growing efforts on visual instruction tuning to extend the …
language models. There have been growing efforts on visual instruction tuning to extend the …
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …
Show-o: One single transformer to unify multimodal understanding and generation
We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …
Vila-u: a unified foundation model integrating visual understanding and generation
VILA-U is a Unified foundation model that integrates Video, Image, Language understanding
and generation. Traditional visual language models (VLMs) use separate modules for …
and generation. Traditional visual language models (VLMs) use separate modules for …
World model on million-length video and language with ringattention
Current language models fall short in understanding aspects of the world not easily
described in words, and struggle with complex, long-form tasks. Video sequences offer …
described in words, and struggle with complex, long-form tasks. Video sequences offer …
Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities
Model merging is an efficient empowerment technique in the machine learning community
that does not require the collection of raw training data and does not require expensive …
that does not require the collection of raw training data and does not require expensive …
Seed-story: Multimodal long story generation with large language model
With the remarkable advancements in image generation and open-form text generation, the
creation of interleaved image-text content has become an increasingly intriguing field …
creation of interleaved image-text content has become an increasingly intriguing field …
Retrieving multimodal information for augmented generation: A survey
As Large Language Models (LLMs) become popular, there emerged an important trend of
using multimodality to augment the LLMs' generation ability, which enables LLMs to better …
using multimodality to augment the LLMs' generation ability, which enables LLMs to better …