Exploring multi-modal contextual knowledge for open-vocabulary object detection
We explore multi-modal contextual knowledge learned through multi-modal masked
language modeling to provide explicit localization guidance for novel classes in open …
language modeling to provide explicit localization guidance for novel classes in open …
Transferable Unintentional Action Localization with Language-guided Intention Translation
Unintentional action localization (UAL) is a challenging task that requires reasoning about
action intention clues to detect the temporal locations of unintentional action occurrences in …
action intention clues to detect the temporal locations of unintentional action occurrences in …
Collaborative Novel Object Discovery and Box-Guided Cross-Modal Alignment for Open-Vocabulary 3D Object Detection
Open-vocabulary 3D Object Detection (OV-3DDet) addresses the detection of objects from
an arbitrary list of novel categories in 3D scenes, which remains a very challenging problem …
an arbitrary list of novel categories in 3D scenes, which remains a very challenging problem …
AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding
Embodied reference understanding is crucial for intelligent agents to predict referents based
on human intention through gesture signals and language descriptions. This paper …
on human intention through gesture signals and language descriptions. This paper …
CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection
Q Chen, W **, J Ge, M Liu, Y Yan, J Jiang, L Yu… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent research on universal object detection aims to introduce language in a SoTA closed-
set detector and then generalize the open-set concepts by constructing large-scale (text …
set detector and then generalize the open-set concepts by constructing large-scale (text …
HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection
Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal
Models (LMM), due to their extensive training data and a large number of parameters …
Models (LMM), due to their extensive training data and a large number of parameters …