Showui: One vision-language-action model for generalist gui agent

KQ Lin, L Li, D Gao, Z Yang, Z Bai, W Lei… - … 2024 Workshop on …, 2024 - openreview.net
Graphical User Interface (GUI) automation holds significant promise for enhancing human
productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and …

Visual prompting in multimodal large language models: A survey

J Wu, Z Zhang, Y **
T Ma, Z Wang, J Zhou, M Wang, J Liang - arxiv preprint arxiv:2411.12286, 2024 - arxiv.org
Inferring affordable (ie, graspable) parts of arbitrary objects based on human specifications
is essential for robots advancing toward open-vocabulary manipulation. Current grasp …

Improving Vision-Language-Action Models via Chain-of-Affordance

J Li, Y Zhu, Z Tang, J Wen, M Zhu, X Liu, C Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Robot foundation models, particularly Vision-Language-Action (VLA) models, have
garnered significant attention for their ability to enhance robot policy learning, greatly …

Objects and Actions Learning Representations for Open-World Robotics

W Yuan - 2024 - search.proquest.com
Advancing robotics involves enabling systems to generalize across diverse and unseen
environments, known as" the open world." Traditional approaches rely on state estimators …

Understanding Depth and Height Perception in Large Visual-Language Models

S Azad, Y Jain, R Garg, YS Rawat, V Vineet - openreview.net
Geometric understanding—including depth and height perception—is fundamental to
intelligence and crucial for navigating our environment. Despite the impressive capabilities …