- Academic Search

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

Enregistrer Citer Cité 5 fois Autres articles Version HTML

[Free GPT-4]

[PDF] arxiv.org

A survey on evaluation of multimodal large language models

J Huang, J Zhang - arxiv preprint arxiv:2408.15769, 2024 - arxiv.org

Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …

Enregistrer Citer Cité 17 fois Autres articles Version HTML

[Free GPT-4]

[PDF] arxiv.org

Foundations and recent trends in multimodal mobile agents: A survey

B Wu, Y Li, M Fang, Z Song, Z Zhang, Y Wei… - arxiv preprint arxiv …, 2024 - arxiv.org

Mobile agents are essential for automating tasks in complex and dynamic mobile
environments. As foundation models evolve, the demands for agents that can adapt in real …

Enregistrer Citer Cité 3 fois Autres articles Version HTML

[Free GPT-4]

[PDF] arxiv.org

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

H Zhang, M Gao, Z Gan, P Dufter, N Wenzel… - arxiv preprint arxiv …, 2024 - arxiv.org

We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …

Enregistrer Citer Cité 15 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Navigating the digital world as humans do: Universal visual grounding for gui agents

B Gou, R Wang, B Zheng, Y **e, C Chang… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal large language models (MLLMs) are transforming the capabilities of graphical
user interface (GUI) agents, facilitating their transition from controlled simulations to …

Enregistrer Citer Cité 12 fois Autres articles Version HTML

[Free GPT-4]

[PDF] arxiv.org

Mia-bench: Towards better instruction following evaluation of multimodal llms

Y Qian, H Ye, JP Fauconnier, P Grasch, Y Yang… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large
language models (MLLMs) on their ability to strictly adhere to complex instructions. Our …

Enregistrer Citer Cité 12 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Amex: Android multi-annotation expo dataset for mobile gui agents

Y Chai, S Huang, Y Niu, H **ao, L Liu, D Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

AI agents have drawn increasing attention mostly on their ability to perceive environments,
understand tasks, and autonomously achieve goals. To advance research on AI agents in …

Enregistrer Citer Cité 14 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Ferret-ui 2: Mastering universal user interface understanding across platforms

Z Li, K You, H Zhang, D Feng, H Agrawal, X Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Building a generalist model for user interface (UI) understanding is challenging due to
various foundational issues, such as platform diversity, resolution variation, and data …

Enregistrer Citer Cité 8 fois Autres articles Version HTML

[Free GPT-4]

[PDF] aclanthology.org

Muffin or chihuahua? challenging multimodal large language models with multipanel vqa

Y Fan, J Gu, K Zhou, Q Yan, S Jiang… - Proceedings of the …, 2024 - aclanthology.org

Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily
lives. These images, characterized by their composition of multiple subfigures in distinct …

Enregistrer Citer Cité 7 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Controlmllm: Training-free visual prompt learning for multimodal large language models

M Wu, X Cai, J Ji, J Li, O Huang, G Luo, H Fei… - arxiv preprint arxiv …, 2024 - arxiv.org

In this work, we propose a training-free method to inject visual referring into Multimodal
Large Language Models (MLLMs) through learnable visual token optimization. We observe …

Enregistrer Citer Cité 8 fois Autres articles Les 3 versions Free GPT-4 Version HTML

Créer l'alerte

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

Ferret-ui: Grounded mobile ui understanding with multimodal llms

Mme-survey: A comprehensive survey on evaluation of multimodal llms

A survey on evaluation of multimodal large language models

Foundations and recent trends in multimodal mobile agents: A survey

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

Navigating the digital world as humans do: Universal visual grounding for gui agents

Mia-bench: Towards better instruction following evaluation of multimodal llms

Amex: Android multi-annotation expo dataset for mobile gui agents

Ferret-ui 2: Mastering universal user interface understanding across platforms

Muffin or chihuahua? challenging multimodal large language models with multipanel vqa

Controlmllm: Training-free visual prompt learning for multimodal large language models