- Academic Search

บันทึก อ้างอิง อ้างโดย14 บทความที่เกี่ยวข้อง ทั้งหมด 2 ฉบับ ดูในรูปแบบ HTML

A survey on text-guided 3D visual grounding: elements, recent advances, and future directions

D Liu, Y Liu, W Huang, W Hu - arxiv preprint arxiv:2406.05785, 2024 - arxiv.org

Text-guided 3D visual grounding (T-3DVG), which aims to locate a specific object that
semantically corresponds to a language query from a complicated 3D scene, has drawn …

บันทึก อ้างอิง อ้างโดย12 บทความที่เกี่ยวข้อง ทั้งหมด 8 ฉบับ ดูในรูปแบบ HTML

[PDF] thecvf.com

Oakink2: A dataset of bimanual hands-object manipulation in complex task completion

X Zhan, L Yang, Y Zhao, K Mao, H Xu… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present OAKINK2 a dataset of bimanual object manipulation tasks for complex daily
activities. In pursuit of constructing the complex tasks into a structured representation …

บันทึก อ้างอิง อ้างโดย12 บทความที่เกี่ยวข้อง ทั้งหมด 5 ฉบับ ดูในรูปแบบ HTML

Meshxl: Neural coordinate field for generative 3d foundation models

S Chen, X Chen, A Pang, X Zeng… - Advances in …, 2025 - proceedings.neurips.cc

The polygon mesh representation of 3D data exhibits great flexibility, fast rendering speed,
and storage efficiency, which is widely preferred in various applications. However, given its …

บันทึก อ้างอิง อ้างโดย4 บทความที่เกี่ยวข้อง ทั้งหมด 3 ฉบับ ดูในรูปแบบ HTML

Lexicon3d: Probing visual foundation models for complex 3d scene understanding

Y Man, S Zheng, Z Bao, M Hebert… - Advances in Neural …, 2025 - proceedings.neurips.cc

Complex 3D scene understanding has gained increasing attention, with scene encoding
strategies built on top of visual foundation models playing a crucial role in this success …

บันทึก อ้างอิง อ้างโดย14 บทความที่เกี่ยวข้อง ทั้งหมด 5 ฉบับ ดูในรูปแบบ HTML

When llms step into the 3d world: A survey and meta-analysis of 3d tasks via multi-modal large language models

X Ma, Y Bhalgat, B Smart, S Chen, X Li, J Ding… - arxiv preprint arxiv …, 2024 - arxiv.org

As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs)
has seen rapid progress, offering unprecedented capabilities for understanding and …

บันทึก อ้างอิง อ้างโดย15 บทความที่เกี่ยวข้อง ทั้งหมด 4 ฉบับ ดูในรูปแบบ HTML

Llara: Supercharging robot learning data for vision-language policy

X Li, C Mata, J Park, K Kahatapitiya, YS Jang… - arxiv preprint arxiv …, 2024 - arxiv.org

LLMs with visual inputs, ie, Vision Language Models (VLMs), have the capacity to process
state information as visual-textual prompts and respond with policy decisions in text. We …

บันทึก อ้างอิง อ้างโดย20 บทความที่เกี่ยวข้อง ทั้งหมด 4 ฉบับ ดูในรูปแบบ HTML

Pandora: Towards general world model with natural language actions and video states

J **ang, G Liu, Y Gu, Q Gao, Y Ning, Y Zha… - arxiv preprint arxiv …, 2024 - arxiv.org

World models simulate future states of the world in response to different actions. They
facilitate interactive content creation and provides a foundation for grounded, long-horizon …

บันทึก อ้างอิง อ้างโดย3 บทความที่เกี่ยวข้อง ทั้งหมด 3 ฉบับ ดูในรูปแบบ HTML

Humanvla: Towards vision-language directed object rearrangement by physical humanoid

X Xu, Y Zhang, YL Li, L Han… - Advances in Neural …, 2025 - proceedings.neurips.cc

Abstract Physical Human-Scene Interaction (HSI) plays a crucial role in numerous
applications. However, existing HSI techniques are limited to specific object dynamics and …