Pivot: Iterative visual prompting elicits actionable knowledge for vlms

S Nasiriany, F **a, W Yu, T **ao, J Liang… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision language models (VLMs) have shown impressive capabilities across a variety of
tasks, from logical reasoning to visual understanding. This opens the door to richer …

When llms step into the 3d world: A survey and meta-analysis of 3d tasks via multi-modal large language models

X Ma, Y Bhalgat, B Smart, S Chen, X Li, J Ding… - arxiv preprint arxiv …, 2024 - arxiv.org
As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs)
has seen rapid progress, offering unprecedented capabilities for understanding and …

Coarse correspondence elicit 3d spacetime understanding in multimodal language model

B Liu, Y Dong, Y Wang, Y Rao, Y Tang, WC Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal language models (MLLMs) are increasingly being implemented in real-world
environments, necessitating their ability to interpret 3D spaces and comprehend temporal …

Visual prompting in multimodal large language models: A survey

J Wu, Z Zhang, Y **a, X Li, Z **a, A Chang, T Yu… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) equip pre-trained large-language models
(LLMs) with visual capabilities. While textual prompting in LLMs has been widely studied …

Gensim2: Scaling robot data generation with multi-modal and reasoning llms

P Hua, M Liu, A Macaluso, Y Lin, W Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Robotic simulation today remains challenging to scale up due to the human efforts required
to create diverse simulation tasks and scenes. Simulation-trained policies also face …

Visual Preference Inference: An Image Sequence-Based Preference Reasoning in Tabletop Object Manipulation

J Lee, S Park, Y Kwon, J Lee, M Ahn… - 2024 IEEE/RSJ …, 2024 - ieeexplore.ieee.org
In robotic object manipulation, human preferences can often be influenced by the visual
attributes of objects, such as color and shape. These properties play a crucial role in …

Coarse Correspondences Boost 3D Spacetime Understanding in Multimodal Language Model

B Liu, Y Dong, Y Wang, Z Ma, Y Tang, L Tang, Y Rao… - openreview.net
Multimodal language models (MLLMs) are increasingly being applied in real-world
environments, necessitating their ability to interpret 3D spaces and compre-hend temporal …