The revolution of multimodal large language models: a survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arxiv preprint arxiv …, 2024 - arxiv.org
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

MM1: methods, analysis and insights from multimodal LLM pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - … on Computer Vision, 2024 - Springer
In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

A survey of multimodal-guided image editing with text-to-image diffusion models

X Shuai, H Ding, X Ma, R Tu, YG Jiang… - arxiv preprint arxiv …, 2024 - arxiv.org
Image editing aims to edit the given synthetic or real image to meet the specific requirements
from users. It is widely studied in recent years as a promising and challenging field of …

Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks

J Wu, M Zhong, S **ng, Z Lai, Z Liu… - Advances in …, 2025 - proceedings.neurips.cc
We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that
unifies visual perception, understanding, and generation within a single framework. Unlike …

Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms

L Yang, Z Yu, C Meng, M Xu, S Ermon… - Forty-first International …, 2024 - openreview.net
Diffusion models have exhibit exceptional performance in text-to-image generation and
editing. However, existing methods often face challenges when handling complex text …

Smartedit: Exploring complex instruction-based image editing with multimodal large language models

Y Huang, L **e, X Wang, Z Yuan… - Proceedings of the …, 2024 - openaccess.thecvf.com
Current instruction-based image editing methods such as InstructPix2Pix often fail to
produce satisfactory results in complex scenarios due to their dependence on the simple …

Diffusion model-based image editing: A survey

Y Huang, J Huang, Y Liu, M Yan, J Lv, J Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Denoising diffusion models have emerged as a powerful tool for various image generation
and editing tasks, facilitating the synthesis of visual content in an unconditional or input …

Towards semantic equivalence of tokenization in multimodal llm

S Wu, H Fei, X Li, J Ji, H Zhang, TS Chua… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in
processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization …

Genartist: Multimodal llm as an agent for unified image generation and editing

Z Wang, A Li, Z Li, X Liu - Advances in Neural Information …, 2025 - proceedings.neurips.cc
Despite the success achieved by existing image generation and editing methods, current
models still struggle with complex problems including intricate text prompts, and the …