Aligning cyber space with physical world: A comprehensive survey on embodied ai

Y Liu, W Chen, Y Bai, X Liang, G Li, W Gao… - arxiv preprint arxiv …, 2024 - arxiv.org
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General
Intelligence (AGI) and serves as a foundation for various applications that bridge cyberspace …

A survey on text-guided 3D visual grounding: elements, recent advances, and future directions

D Liu, Y Liu, W Huang, W Hu - arxiv preprint arxiv:2406.05785, 2024 - arxiv.org
Text-guided 3D visual grounding (T-3DVG), which aims to locate a specific object that
semantically corresponds to a language query from a complicated 3D scene, has drawn …

Oakink2: A dataset of bimanual hands-object manipulation in complex task completion

X Zhan, L Yang, Y Zhao, K Mao, H Xu… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present OAKINK2 a dataset of bimanual object manipulation tasks for complex daily
activities. In pursuit of constructing the complex tasks into a structured representation …

Meshxl: Neural coordinate field for generative 3d foundation models

S Chen, X Chen, A Pang, X Zeng… - Advances in …, 2025 - proceedings.neurips.cc
The polygon mesh representation of 3D data exhibits great flexibility, fast rendering speed,
and storage efficiency, which is widely preferred in various applications. However, given its …

Lexicon3d: Probing visual foundation models for complex 3d scene understanding

Y Man, S Zheng, Z Bao, M Hebert… - Advances in Neural …, 2025 - proceedings.neurips.cc
Complex 3D scene understanding has gained increasing attention, with scene encoding
strategies built on top of visual foundation models playing a crucial role in this success …

Llara: Supercharging robot learning data for vision-language policy

X Li, C Mata, J Park, K Kahatapitiya, YS Jang… - arxiv preprint arxiv …, 2024 - arxiv.org
LLMs with visual inputs, ie, Vision Language Models (VLMs), have the capacity to process
state information as visual-textual prompts and respond with policy decisions in text. We …

Pandora: Towards general world model with natural language actions and video states

J **ang, G Liu, Y Gu, Q Gao, Y Ning, Y Zha… - arxiv preprint arxiv …, 2024 - arxiv.org
World models simulate future states of the world in response to different actions. They
facilitate interactive content creation and provides a foundation for grounded, long-horizon …

Humanvla: Towards vision-language directed object rearrangement by physical humanoid

X Xu, Y Zhang, YL Li, L Han… - Advances in Neural …, 2025 - proceedings.neurips.cc
Abstract Physical Human-Scene Interaction (HSI) plays a crucial role in numerous
applications. However, existing HSI techniques are limited to specific object dynamics and …

Chatcam: Empowering camera control through conversational ai

X Liu, YW Tai, CK Tang - Advances in Neural Information …, 2025 - proceedings.neurips.cc
Cinematographers adeptly capture the essence of the world, crafting compelling visual
narratives through intricate camera movements. Witnessing the strides made by large …