Glamm: Pixel grounding large multimodal model
Abstract Large Multimodal Models (LMMs) extend Large Language Models to the vision
domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual …
domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual …
Towards open vocabulary learning: A survey
In the field of visual scene understanding, deep neural networks have made impressive
advancements in various core tasks like segmentation, tracking, and detection. However …
advancements in various core tasks like segmentation, tracking, and detection. However …
A survey on open-vocabulary detection and segmentation: Past, present, and future
As the most fundamental scene understanding tasks, object detection and segmentation
have made tremendous progress in deep learning era. Due to the expensive manual …
have made tremendous progress in deep learning era. Due to the expensive manual …
Multi-modal queried object detection in the wild
We introduce MQ-Det, an efficient architecture and pre-training strategy design to utilize both
textual description with open-set generalization and visual exemplars with rich description …
textual description with open-set generalization and visual exemplars with rich description …
Learning background prompts to discover implicit knowledge for open vocabulary object detection
Open vocabulary object detection (OVD) aims at seeking an optimal object detector capable
of recognizing objects from both base and novel categories. Recent advances leverage …
of recognizing objects from both base and novel categories. Recent advances leverage …
Promptkd: Unsupervised prompt distillation for vision-language models
Prompt learning has emerged as a valuable technique in enhancing vision-language
models (VLMs) such as CLIP for downstream tasks in specific domains. Existing work mainly …
models (VLMs) such as CLIP for downstream tasks in specific domains. Existing work mainly …
Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond
In this study, we explore the potential of Multimodal Large Language Models (MLLMs) in
improving embodied decision-making processes for agents. While Large Language Models …
improving embodied decision-making processes for agents. While Large Language Models …
Improving zero-shot generalization of learned prompts via unsupervised knowledge distillation
Abstract Vision-Language Models (VLMs) demonstrate remarkable zero-shot generalization
to unseen tasks, but fall short of the performance of supervised methods in generalizing to …
to unseen tasks, but fall short of the performance of supervised methods in generalizing to …
Simple image-level classification improves open-vocabulary object detection
Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a given set
of base categories on which the detection model is trained. Recent OVOD methods focus on …
of base categories on which the detection model is trained. Recent OVOD methods focus on …
[HTML][HTML] Ov-vg: A benchmark for open-vocabulary visual grounding
Open-vocabulary learning has emerged as a cutting-edge research area, particularly in light
of the widespread adoption of vision-based foundational models. Its primary objective is to …
of the widespread adoption of vision-based foundational models. Its primary objective is to …