- Academic Search

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org

With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Gem Citer Citeret af 1236 Relaterede artikler Alle 12 versioner

[Free GPT-4]
[DeepSeek]

[PDF] springer.com

A survey on large language model based autonomous agents

L Wang, C Ma, X Feng, Z Zhang, H Yang… - Frontiers of Computer …, 2024 - Springer

Autonomous agents have long been a research focus in academic and industry
communities. Previous research often focuses on training agents with limited knowledge …

Gem Citer Citeret af 1015 Relaterede artikler Alle 7 versioner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

JY Koh, R Lo, L Jang, V Duvvur, MC Lim… - arxiv preprint arxiv …, 2024 - arxiv.org

Autonomous agents capable of planning, reasoning, and executing actions on the web offer
a promising avenue for automating computer tasks. However, the majority of existing …

Gem Citer Citeret af 132 Relaterede artikler Alle 6 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning

S Chen, X Chen, C Zhang, M Li, G Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Recent progress in Large Multimodal Models (LMM) has opened up great
possibilities for various applications in the field of human-machine interactions. However …

Gem Citer Citeret af 65 Relaterede artikler Alle 6 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Seeclick: Harnessing gui grounding for advanced visual gui agents

K Cheng, Q Sun, Y Chu, F Xu, Y Li, J Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Graphical User Interface (GUI) agents are designed to automate complex tasks on digital
devices, such as smartphones and desktops. Most existing GUI agents interact with the …

Gem Citer Citeret af 100 Relaterede artikler Alle 6 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sphinx-x: Scaling data and parameters for a family of multi-modal large language models

D Liu, R Zhang, L Qiu, S Huang, W Lin, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series
developed upon SPHINX. To improve the architecture and training efficiency, we modify the …

Gem Citer Citeret af 100 Relaterede artikler Alle 8 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Textmonkey: An ocr-free large multimodal model for understanding document

Y Liu, B Yang, Q Liu, Z Li, Z Ma, S Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. Our
approach introduces enhancement across several dimensions: By adopting Shifted Window …

Gem Citer Citeret af 84 Relaterede artikler Alle 3 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

K Ying, F Meng, J Wang, Z Li, H Lin, Y Yang… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Vision-Language Models (LVLMs) show significant strides in general-purpose
multimodal applications such as visual dialogue and embodied navigation. However …

Gem Citer Citeret af 60 Relaterede artikler Alle 6 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Promises and challenges of generative artificial intelligence for human learning

L Yan, S Greiff, Z Teuber, D Gašević - Nature Human Behaviour, 2024 - nature.com

Generative artificial intelligence (GenAI) holds the potential to transform the delivery,
cultivation and evaluation of human learning. Here the authors examine the integration of …

Gem Citer Citeret af 29 Relaterede artikler Alle 8 versioner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

You only look at screens: Multimodal chain-of-action agents

Z Zhang, A Zhang - arxiv preprint arxiv:2309.11436, 2023 - arxiv.org

Autonomous graphical user interface (GUI) agents aim to facilitate task automation by
interacting with the user interface without manual intervention. Recent studies have …

Gem Citer Citeret af 73 Relaterede artikler Alle 4 versioner Vis som HTML

Opret underretning

Citer

Avanceret søgning

Gemt i Min samling

Appagent: Multimodal agents as smartphone users

A Survey of Multimodel Large Language Models

A survey on large language model based autonomous agents

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning

Seeclick: Harnessing gui grounding for advanced visual gui agents

Sphinx-x: Scaling data and parameters for a family of multi-modal large language models

Textmonkey: An ocr-free large multimodal model for understanding document

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

Promises and challenges of generative artificial intelligence for human learning

You only look at screens: Multimodal chain-of-action agents