A survey on integration of large language models with intelligent robots

Y Kim, D Kim, J Choi, J Park, N Oh, D Park - Intelligent Service Robotics, 2024 - Springer
In recent years, the integration of large language models (LLMs) has revolutionized the field
of robotics, enabling robots to communicate, understand, and reason with human-like …

Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent

J Yang, X Chen, S Qian, N Madaan… - … on Robotics and …, 2024 - ieeexplore.ieee.org
3D visual grounding is a critical skill for household robots, enabling them to navigate,
manipulate objects, and answer questions based on their environment. While existing …

Multi-object hallucination in vision language models

X Chen, Z Ma, X Zhang, S Xu, S Qian… - Advances in …, 2025 - proceedings.neurips.cc
Large vision language models (LVLMs) often suffer from object hallucination, producing
objects not present in the given images. While current benchmarks for object hallucination …

One-shot open affordance learning with foundation models

G Li, D Sun, L Sevilla-Lara… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract We introduce One-shot Open Affordance Learning (OOAL) where a model is trained
with just one example per base object category but is expected to identify novel objects and …

Llara: Supercharging robot learning data for vision-language policy

X Li, C Mata, J Park, K Kahatapitiya, YS Jang… - arxiv preprint arxiv …, 2024 - arxiv.org
LLMs with visual inputs, ie, Vision Language Models (VLMs), have the capacity to process
state information as visual-textual prompts and respond with policy decisions in text. We …

Learning precise affordances from egocentric videos for robotic manipulation

G Li, N Tsagkas, J Song, R Mon-Williams… - arxiv preprint arxiv …, 2024 - arxiv.org
Affordance, defined as the potential actions that an object offers, is crucial for robotic
manipulation tasks. A deep understanding of affordance can lead to more intelligent AI …

Manipvqa: Injecting robotic affordance and physically grounded information into multi-modal large language models

S Huang, I Ponomarenko, Z Jiang, X Li, X Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
The integration of Multimodal Large Language Models (MLLMs) with robotic systems has
significantly enhanced the ability of robots to interpret and act upon natural language …

What does CLIP know about peeling a banana?

C Cuttano, G Rosi, G Trivigno… - Proceedings of the …, 2024 - openaccess.thecvf.com
Humans show an innate capability to identify tools to support specific actions. The
association between objects parts and the actions they facilitate is usually named …

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

H Chen, Y Ni, W Huang, Y Liu, SH Jeong… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision Transformers (ViTs) have emerged as the backbone of many segmentation models,
consistently achieving state-of-the-art (SOTA) performance. However, their success comes …

UniHOI: Learning Fast, Dense and Generalizable 4D Reconstruction for Egocentric Hand Object Interaction Videos

C Yuan, G Chen, L Yi, Y Gao - arxiv preprint arxiv:2411.09145, 2024 - arxiv.org
Egocentric Hand Object Interaction (HOI) videos provide valuable insights into human
interactions with the physical world, attracting growing interest from the computer vision and …