A Survey of Multimodel Large Language Models

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org
With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

A survey on large language model based autonomous agents

L Wang, C Ma, X Feng, Z Zhang, H Yang… - Frontiers of Computer …, 2024 - Springer
Autonomous agents have long been a research focus in academic and industry
communities. Previous research often focuses on training agents with limited knowledge …

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

JY Koh, R Lo, L Jang, V Duvvur, MC Lim… - arxiv preprint arxiv …, 2024 - arxiv.org
Autonomous agents capable of planning, reasoning, and executing actions on the web offer
a promising avenue for automating computer tasks. However, the majority of existing …

Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning

S Chen, X Chen, C Zhang, M Li, G Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Recent progress in Large Multimodal Models (LMM) has opened up great
possibilities for various applications in the field of human-machine interactions. However …

Seeclick: Harnessing gui grounding for advanced visual gui agents

K Cheng, Q Sun, Y Chu, F Xu, Y Li, J Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Graphical User Interface (GUI) agents are designed to automate complex tasks on digital
devices, such as smartphones and desktops. Most existing GUI agents interact with the …

Sphinx-x: Scaling data and parameters for a family of multi-modal large language models

D Liu, R Zhang, L Qiu, S Huang, W Lin, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series
developed upon SPHINX. To improve the architecture and training efficiency, we modify the …

Textmonkey: An ocr-free large multimodal model for understanding document

Y Liu, B Yang, Q Liu, Z Li, Z Ma, S Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. Our
approach introduces enhancement across several dimensions: By adopting Shifted Window …

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

K Ying, F Meng, J Wang, Z Li, H Lin, Y Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) show significant strides in general-purpose
multimodal applications such as visual dialogue and embodied navigation. However …

Promises and challenges of generative artificial intelligence for human learning

L Yan, S Greiff, Z Teuber, D Gašević - Nature Human Behaviour, 2024 - nature.com
Generative artificial intelligence (GenAI) holds the potential to transform the delivery,
cultivation and evaluation of human learning. Here the authors examine the integration of …

You only look at screens: Multimodal chain-of-action agents

Z Zhang, A Zhang - arxiv preprint arxiv:2309.11436, 2023 - arxiv.org
Autonomous graphical user interface (GUI) agents aim to facilitate task automation by
interacting with the user interface without manual intervention. Recent studies have …