Towards open vocabulary learning: A survey
In the field of visual scene understanding, deep neural networks have made impressive
advancements in various core tasks like segmentation, tracking, and detection. However …
advancements in various core tasks like segmentation, tracking, and detection. However …
Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain
Multimodal large language models (MLLMs) have demonstrated remarkable success in
vision and visual-language tasks within the natural image domain. Owing to the significant …
vision and visual-language tasks within the natural image domain. Owing to the significant …
Mmro: Are multimodal llms eligible as the brain for in-home robotics?
It is fundamentally challenging for robots to serve as useful assistants in human
environments because this requires addressing a spectrum of sub-problems across robotics …
environments because this requires addressing a spectrum of sub-problems across robotics …
3d-gres: Generalized 3d referring expression segmentation
3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific
instance within a 3D space based on a natural language description. However, current …
instance within a 3D space based on a natural language description. However, current …
Dino-x: A unified vision model for open-world object detection and understanding
In this paper, we introduce DINO-X, which is a unified object-centric vision model developed
by IDEA Research with the best open-world object detection performance to date. DINO-X …
by IDEA Research with the best open-world object detection performance to date. DINO-X …
Learning visual grounding from generative vision and language model
Visual grounding tasks aim to localize image regions based on natural language references.
In this work, we explore whether generative VLMs predominantly trained on image-text data …
In this work, we explore whether generative VLMs predominantly trained on image-text data …
Auto cherry-picker: Learning from high-quality generative data driven by language
Diffusion-based models have shown great potential in generating high-quality images with
various layouts, which can benefit downstream perception tasks. However, a fully automatic …
various layouts, which can benefit downstream perception tasks. However, a fully automatic …
RoboCup@ Home 2024 OPL winner NimbRo: Anthropomorphic service robots using foundation models for perception and planning
We present the approaches and contributions of the winning team NimbRo@ Home at the
RoboCup@ Home 2024 competition in the Open Platform League held in Eindhoven, NL …
RoboCup@ Home 2024 competition in the Open Platform League held in Eindhoven, NL …
CamoEnv: Transferable and environment-consistent adversarial camouflage in autonomous driving
Adversarial camouflage has garnered significant attention in the security literature on
autonomous driving. The ability to adapt to various angles makes adversarial camouflage …
autonomous driving. The ability to adapt to various angles makes adversarial camouflage …
DynamicEarth: How Far are We from Open-Vocabulary Change Detection?
Monitoring Earth's evolving land covers requires methods capable of detecting changes
across a wide range of categories and contexts. Existing change detection methods are …
across a wide range of categories and contexts. Existing change detection methods are …