Exploring multi-modal contextual knowledge for open-vocabulary object detection

Y Xu, M Zhang, X Yang, C Xu - IEEE Transactions on Image …, 2024 - ieeexplore.ieee.org
We explore multi-modal contextual knowledge learned through multi-modal masked
language modeling to provide explicit localization guidance for novel classes in open …

Transferable Unintentional Action Localization with Language-guided Intention Translation

J Xu, Y Rao, J Zhou, J Lu - IEEE Transactions on Pattern …, 2025 - ieeexplore.ieee.org
Unintentional action localization (UAL) is a challenging task that requires reasoning about
action intention clues to detect the temporal locations of unintentional action occurrences in …

Collaborative Novel Object Discovery and Box-Guided Cross-Modal Alignment for Open-Vocabulary 3D Object Detection

Y Cao, Y Zeng, H Xu, D Xu - arxiv preprint arxiv:2406.00830, 2024 - arxiv.org
Open-vocabulary 3D Object Detection (OV-3DDet) addresses the detection of objects from
an arbitrary list of novel categories in 3D scenes, which remains a very challenging problem …

AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding

H Guo, W Fan, B Wei, J Zhu, J Tian, C Yi… - arxiv preprint arxiv …, 2024 - arxiv.org
Embodied reference understanding is crucial for intelligent agents to predict referents based
on human intention through gesture signals and language descriptions. This paper …

CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection

Q Chen, W **, J Ge, M Liu, Y Yan, J Jiang, L Yu… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent research on universal object detection aims to introduce language in a SoTA closed-
set detector and then generalize the open-set concepts by constructing large-scale (text …

HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection

Y Ma, M Liu, C Zhu, XC Yin - arxiv preprint arxiv:2409.16136, 2024 - arxiv.org
Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal
Models (LMM), due to their extensive training data and a large number of parameters …