Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models

Z Wang, S Cai, A Liu, Y **, J Hou… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Achieving human-like planning and control with multimodal observations in an open world is
a key milestone for more functional generalist agents. Existing approaches can handle …

Computational experiments meet large language model based agents: A survey and perspective

Q Ma, X Xue, D Zhou, X Yu, D Liu, X Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Computational experiments have emerged as a valuable method for studying complex
systems, involving the algorithmization of counterfactuals. However, accurately representing …

Rocket-1: Mastering open-world interaction with visual-temporal context prompting

S Cai, Z Wang, K Lian, Z Mu, X Ma, A Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to
embodied decision-making in open-world environments presents challenges. One critical …

Creative agents: Empowering agents with imagination for creative tasks

C Zhang, P Cai, Y Fu, H Yuan, Z Lu - arxiv preprint arxiv:2312.02519, 2023 - arxiv.org
We study building embodied agents for open-ended creative tasks. While existing methods
build instruction-following agents that can perform diverse open-ended tasks, none of them …

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Z Wang, S Cai, Z Mu, H Lin, C Zhang, X Liu, Q Li… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper presents OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-
world instruction-following agents in Minecraft. Compared to prior works that either emit …

Odyssey: Empowering Minecraft Agents with Open-World Skills

S Liu, Y Li, K Zhang, Z Cui, W Fang, Y Zheng… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent studies have delved into constructing generalist agents for open-world environments
like Minecraft. Despite the encouraging results, existing efforts mainly focus on solving basic …

Groot-1.5: Learning to follow multi-modal instructions from weak supervision

S Cai, B Zhang, Z Wang, X Ma, A Liu… - Multi-modal Foundation …, 2024 - openreview.net
This paper studies the problem of learning an agent policy that can follow various forms of
instructions. Specifically, we focus on multi-modal instructions: the policy is expected to …

MageBench: Bridging Large Multimodal Models to Agents

M Zhang, Q Dai, Y Yang, J Bao, D Chen, K Qiu… - arxiv preprint arxiv …, 2024 - arxiv.org
LMMs have shown impressive visual understanding capabilities, with the potential to be
applied in agents, which demand strong reasoning and planning abilities. Nevertheless …