Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Generative AI and process systems engineering: The next frontier

B Decardi-Nelson, AS Alshehri, A Ajagekar… - Computers & Chemical …, 2024 - Elsevier
This review article explores how emerging generative artificial intelligence (GenAI) models,
such as large language models (LLMs), can enhance solution methodologies within process …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Visual autoregressive modeling: Scalable image generation via next-scale prediction

K Tian, Y Jiang, Z Yuan, B Peng… - Advances in neural …, 2025 - proceedings.neurips.cc
Abstract We present Visual AutoRegressive modeling (VAR), a new generation paradigm
that redefines the autoregressive learning on images as coarse-to-fine" next-scale …

Generative multimodal models are in-context learners

Q Sun, Y Cui, X Zhang, F Zhang, Q Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Humans can easily solve multimodal tasks in context with only a few demonstrations or
simple instructions which current multimodal systems largely struggle to imitate. In this work …

Ferret: Refer and ground anything anywhere at any granularity

H You, H Zhang, Z Gan, X Du, B Zhang, Z Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of
understanding spatial referring of any shape or granularity within an image and accurately …

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Emu: Generative pretraining in multimodality

Q Sun, Q Yu, Y Cui, F Zhang, X Zhang, Y Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
We present Emu, a Transformer-based multimodal foundation model, which can seamlessly
generate images and texts in multimodal context. This omnivore model can take in any …

An image is worth 32 tokens for reconstruction and generation

Q Yu, M Weber, X Deng, X Shen… - Advances in Neural …, 2025 - proceedings.neurips.cc
Recent advancements in generative models have highlighted the crucial role of image
tokenization in the efficient synthesis of high-resolution images. Tokenization, which …

Dreamllm: Synergistic multimodal comprehension and creation

R Dong, C Han, Y Peng, Z Qi, Z Ge, J Yang… - arxiv preprint arxiv …, 2023 - arxiv.org
This paper presents DreamLLM, a learning framework that first achieves versatile
Multimodal Large Language Models (MLLMs) empowered with frequently overlooked …