Mme-survey: A comprehensive survey on evaluation of multimodal llms

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

A survey on evaluation of multimodal large language models

J Huang, J Zhang - arxiv preprint arxiv:2408.15769, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …

Foundations and recent trends in multimodal mobile agents: A survey

B Wu, Y Li, M Fang, Z Song, Z Zhang, Y Wei… - arxiv preprint arxiv …, 2024 - arxiv.org
Mobile agents are essential for automating tasks in complex and dynamic mobile
environments. As foundation models evolve, the demands for agents that can adapt in real …

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

H Zhang, M Gao, Z Gan, P Dufter, N Wenzel… - arxiv preprint arxiv …, 2024 - arxiv.org
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …

Navigating the digital world as humans do: Universal visual grounding for gui agents

B Gou, R Wang, B Zheng, Y **e, C Chang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) are transforming the capabilities of graphical
user interface (GUI) agents, facilitating their transition from controlled simulations to …

Mia-bench: Towards better instruction following evaluation of multimodal llms

Y Qian, H Ye, JP Fauconnier, P Grasch, Y Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large
language models (MLLMs) on their ability to strictly adhere to complex instructions. Our …

Amex: Android multi-annotation expo dataset for mobile gui agents

Y Chai, S Huang, Y Niu, H **ao, L Liu, D Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
AI agents have drawn increasing attention mostly on their ability to perceive environments,
understand tasks, and autonomously achieve goals. To advance research on AI agents in …

Ferret-ui 2: Mastering universal user interface understanding across platforms

Z Li, K You, H Zhang, D Feng, H Agrawal, X Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Building a generalist model for user interface (UI) understanding is challenging due to
various foundational issues, such as platform diversity, resolution variation, and data …

Muffin or chihuahua? challenging multimodal large language models with multipanel vqa

Y Fan, J Gu, K Zhou, Q Yan, S Jiang… - Proceedings of the …, 2024 - aclanthology.org
Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily
lives. These images, characterized by their composition of multiple subfigures in distinct …

Controlmllm: Training-free visual prompt learning for multimodal large language models

M Wu, X Cai, J Ji, J Li, O Huang, G Luo, H Fei… - arxiv preprint arxiv …, 2024 - arxiv.org
In this work, we propose a training-free method to inject visual referring into Multimodal
Large Language Models (MLLMs) through learnable visual token optimization. We observe …