A survey on integration of large language models with intelligent robots
In recent years, the integration of large language models (LLMs) has revolutionized the field
of robotics, enabling robots to communicate, understand, and reason with human-like …
of robotics, enabling robots to communicate, understand, and reason with human-like …
Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent
3D visual grounding is a critical skill for household robots, enabling them to navigate,
manipulate objects, and answer questions based on their environment. While existing …
manipulate objects, and answer questions based on their environment. While existing …
Multi-object hallucination in vision language models
X Chen, Z Ma, X Zhang, S Xu, S Qian… - Advances in …, 2025 - proceedings.neurips.cc
Large vision language models (LVLMs) often suffer from object hallucination, producing
objects not present in the given images. While current benchmarks for object hallucination …
objects not present in the given images. While current benchmarks for object hallucination …
One-shot open affordance learning with foundation models
Abstract We introduce One-shot Open Affordance Learning (OOAL) where a model is trained
with just one example per base object category but is expected to identify novel objects and …
with just one example per base object category but is expected to identify novel objects and …
Llara: Supercharging robot learning data for vision-language policy
LLMs with visual inputs, ie, Vision Language Models (VLMs), have the capacity to process
state information as visual-textual prompts and respond with policy decisions in text. We …
state information as visual-textual prompts and respond with policy decisions in text. We …
Learning precise affordances from egocentric videos for robotic manipulation
Affordance, defined as the potential actions that an object offers, is crucial for robotic
manipulation tasks. A deep understanding of affordance can lead to more intelligent AI …
manipulation tasks. A deep understanding of affordance can lead to more intelligent AI …
Manipvqa: Injecting robotic affordance and physically grounded information into multi-modal large language models
The integration of Multimodal Large Language Models (MLLMs) with robotic systems has
significantly enhanced the ability of robots to interpret and act upon natural language …
significantly enhanced the ability of robots to interpret and act upon natural language …
What does CLIP know about peeling a banana?
Humans show an innate capability to identify tools to support specific actions. The
association between objects parts and the actions they facilitate is usually named …
association between objects parts and the actions they facilitate is usually named …
VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation
Vision Transformers (ViTs) have emerged as the backbone of many segmentation models,
consistently achieving state-of-the-art (SOTA) performance. However, their success comes …
consistently achieving state-of-the-art (SOTA) performance. However, their success comes …
UniHOI: Learning Fast, Dense and Generalizable 4D Reconstruction for Egocentric Hand Object Interaction Videos
Egocentric Hand Object Interaction (HOI) videos provide valuable insights into human
interactions with the physical world, attracting growing interest from the computer vision and …
interactions with the physical world, attracting growing interest from the computer vision and …