Llm-planner: Few-shot grounded planning for embodied agents with large language models
This study focuses on using large language models (LLMs) as a planner for embodied
agents that can follow natural language instructions to complete complex tasks in a visually …
agents that can follow natural language instructions to complete complex tasks in a visually …
Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action
Goal-conditioned policies for robotic navigation can be trained on large, unannotated
datasets, providing for good generalization to real-world settings. However, particularly in …
datasets, providing for good generalization to real-world settings. However, particularly in …
Habitat 2.0: Training home assistants to rearrange their habitat
Abstract We introduce Habitat 2.0 (H2. 0), a simulation platform for training virtual robots in
interactive 3D environments and complex physics-enabled scenarios. We make …
interactive 3D environments and complex physics-enabled scenarios. We make …
How much can clip benefit vision-and-language tasks?
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
Language to rewards for robotic skill synthesis
Large language models (LLMs) have demonstrated exciting progress in acquiring diverse
new capabilities through in-context learning, ranging from logical reasoning to code-writing …
new capabilities through in-context learning, ranging from logical reasoning to code-writing …
Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai
We present the Habitat-Matterport 3D (HM3D) dataset. HM3D is a large-scale dataset of
1,000 building-scale 3D reconstructions from a diverse set of real-world locations. Each …
1,000 building-scale 3D reconstructions from a diverse set of real-world locations. Each …
A survey on multimodal large language models for autonomous driving
With the emergence of Large Language Models (LLMs) and Vision Foundation Models
(VFMs), multimodal AI systems benefiting from large models have the potential to equally …
(VFMs), multimodal AI systems benefiting from large models have the potential to equally …
History aware multimodal transformer for vision-and-language navigation
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow
instructions and navigate in real scenes. To remember previously visited locations and …
instructions and navigate in real scenes. To remember previously visited locations and …
Embodied navigation with multi-modal information: A survey from tasks to methodology
Embodied AI aims to create agents that complete complex tasks by interacting with the
environment. A key problem in this field is embodied navigation which understands multi …
environment. A key problem in this field is embodied navigation which understands multi …
Conversational information seeking
Conversational information seeking (CIS) is concerned with a sequence of interactions
between one or more users and an information system. Interactions in CIS are primarily …
between one or more users and an information system. Interactions in CIS are primarily …