A Survey of Multimodel Large Language Models

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org
With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

What matters when building vision-language models?

H Laurençon, L Tronchon, M Cord… - Advances in Neural …, 2025 - proceedings.neurips.cc
The growing interest in vision-language models (VLMs) has been driven by improvements in
large language models and vision transformers. Despite the abundance of literature on this …

Seeclick: Harnessing gui grounding for advanced visual gui agents

K Cheng, Q Sun, Y Chu, F Xu, Y Li, J Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Graphical User Interface (GUI) agents are designed to automate complex tasks on digital
devices, such as smartphones and desktops. Most existing GUI agents interact with the …

Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

[HTML][HTML] Empowering biomedical discovery with AI agents

S Gao, A Fang, Y Huang, V Giunchiglia, A Noori… - Cell, 2024 - cell.com
We envision" AI scientists" as systems capable of skeptical learning and reasoning that
empower biomedical research through collaborative agents that integrate AI models and …

Viescore: Towards explainable metrics for conditional image synthesis evaluation

M Ku, D Jiang, C Wei, X Yue, W Chen - arxiv preprint arxiv:2312.14867, 2023 - arxiv.org
In the rapidly advancing field of conditional image generation research, challenges such as
limited explainability lie in effectively evaluating the performance and capabilities of various …

Building and better understanding vision-language models: insights and future directions

H Laurençon, A Marafioti, V Sanh… - … on Responsibly Building …, 2024 - openreview.net
The field of vision-language models (VLMs), which take images and texts as inputs and
output texts, is rapidly evolving and has yet to reach consensus on several key aspects of …

From concept to manufacturing: Evaluating vision-language models for engineering design

C Picard, KM Edwards, AC Doris, B Man… - arxiv preprint arxiv …, 2023 - arxiv.org
Engineering design is undergoing a transformative shift with the advent of AI, marking a new
era in how we approach product, system, and service planning. Large language models …

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation

J Wen, Y Zhu, J Li, M Zhu, K Wu, Z Xu, N Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor
control and instruction comprehension through end-to-end learning processes. However …