Mme-survey: A comprehensive survey on evaluation of multimodal llms
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …
Models (MLLMs) have garnered increased attention from both industry and academia …
A survey on evaluation of multimodal large language models
J Huang, J Zhang - arxiv preprint arxiv:2408.15769, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …
system by integrating powerful Large Language Models (LLMs) with various modality …
Foundations and recent trends in multimodal mobile agents: A survey
Mobile agents are essential for automating tasks in complex and dynamic mobile
environments. As foundation models evolve, the demands for agents that can adapt in real …
environments. As foundation models evolve, the demands for agents that can adapt in real …
Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …
to enhance capabilities in text-rich image understanding, visual referring and grounding …
Navigating the digital world as humans do: Universal visual grounding for gui agents
Multimodal large language models (MLLMs) are transforming the capabilities of graphical
user interface (GUI) agents, facilitating their transition from controlled simulations to …
user interface (GUI) agents, facilitating their transition from controlled simulations to …
Mia-bench: Towards better instruction following evaluation of multimodal llms
We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large
language models (MLLMs) on their ability to strictly adhere to complex instructions. Our …
language models (MLLMs) on their ability to strictly adhere to complex instructions. Our …
Amex: Android multi-annotation expo dataset for mobile gui agents
AI agents have drawn increasing attention mostly on their ability to perceive environments,
understand tasks, and autonomously achieve goals. To advance research on AI agents in …
understand tasks, and autonomously achieve goals. To advance research on AI agents in …
Ferret-ui 2: Mastering universal user interface understanding across platforms
Building a generalist model for user interface (UI) understanding is challenging due to
various foundational issues, such as platform diversity, resolution variation, and data …
various foundational issues, such as platform diversity, resolution variation, and data …
Muffin or chihuahua? challenging multimodal large language models with multipanel vqa
Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily
lives. These images, characterized by their composition of multiple subfigures in distinct …
lives. These images, characterized by their composition of multiple subfigures in distinct …
Controlmllm: Training-free visual prompt learning for multimodal large language models
In this work, we propose a training-free method to inject visual referring into Multimodal
Large Language Models (MLLMs) through learnable visual token optimization. We observe …
Large Language Models (MLLMs) through learnable visual token optimization. We observe …