Towards open vocabulary learning: A survey

J Wu, X Li, S Xu, H Yuan, H Ding… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
In the field of visual scene understanding, deep neural networks have made impressive
advancements in various core tasks like segmentation, tracking, and detection. However …

Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain

W Zhang, M Cai, T Zhang, Y Zhuang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Multimodal large language models (MLLMs) have demonstrated remarkable success in
vision and visual-language tasks within the natural image domain. Owing to the significant …

Mmro: Are multimodal llms eligible as the brain for in-home robotics?

J Li, Y Zhu, Z Xu, J Gu, M Zhu, X Liu, N Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
It is fundamentally challenging for robots to serve as useful assistants in human
environments because this requires addressing a spectrum of sub-problems across robotics …

3d-gres: Generalized 3d referring expression segmentation

C Wu, Y Liu, J Ji, Y Ma, H Wang, G Luo… - Proceedings of the …, 2024 - dl.acm.org
3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific
instance within a 3D space based on a natural language description. However, current …

Dino-x: A unified vision model for open-world object detection and understanding

T Ren, Y Chen, Q Jiang, Z Zeng, Y **ong, W Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we introduce DINO-X, which is a unified object-centric vision model developed
by IDEA Research with the best open-world object detection performance to date. DINO-X …

Learning visual grounding from generative vision and language model

S Wang, D Kim, A Taalimi, C Sun, W Kuo - arxiv preprint arxiv:2407.14563, 2024 - arxiv.org
Visual grounding tasks aim to localize image regions based on natural language references.
In this work, we explore whether generative VLMs predominantly trained on image-text data …

Auto cherry-picker: Learning from high-quality generative data driven by language

Y Chen, X Li, Y Li, Y Zeng, J Wu, X Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
Diffusion-based models have shown great potential in generating high-quality images with
various layouts, which can benefit downstream perception tasks. However, a fully automatic …

RoboCup@ Home 2024 OPL winner NimbRo: Anthropomorphic service robots using foundation models for perception and planning

R Memmesheimer, J Nogga, B Pätzold… - arxiv preprint arxiv …, 2024 - arxiv.org
We present the approaches and contributions of the winning team NimbRo@ Home at the
RoboCup@ Home 2024 competition in the Open Platform League held in Eindhoven, NL …

CamoEnv: Transferable and environment-consistent adversarial camouflage in autonomous driving

Z Zhu, X Yang, H Su, S Zheng - Pattern Recognition Letters, 2025 - Elsevier
Adversarial camouflage has garnered significant attention in the security literature on
autonomous driving. The ability to adapt to various angles makes adversarial camouflage …

DynamicEarth: How Far are We from Open-Vocabulary Change Detection?

K Li, X Cao, Y Deng, C Pang, Z **n, D Meng… - arxiv preprint arxiv …, 2025 - arxiv.org
Monitoring Earth's evolving land covers requires methods capable of detecting changes
across a wide range of categories and contexts. Existing change detection methods are …