Glamm: Pixel grounding large multimodal model

H Rasheed, M Maaz, S Shaji… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Multimodal Models (LMMs) extend Large Language Models to the vision
domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual …

Towards open vocabulary learning: A survey

J Wu, X Li, S Xu, H Yuan, H Ding… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
In the field of visual scene understanding, deep neural networks have made impressive
advancements in various core tasks like segmentation, tracking, and detection. However …

A survey on open-vocabulary detection and segmentation: Past, present, and future

C Zhu, L Chen - IEEE Transactions on Pattern Analysis and …, 2024 - ieeexplore.ieee.org
As the most fundamental scene understanding tasks, object detection and segmentation
have made tremendous progress in deep learning era. Due to the expensive manual …

Multi-modal queried object detection in the wild

Y Xu, M Zhang, C Fu, P Chen… - Advances in Neural …, 2023 - proceedings.neurips.cc
We introduce MQ-Det, an efficient architecture and pre-training strategy design to utilize both
textual description with open-set generalization and visual exemplars with rich description …

Learning background prompts to discover implicit knowledge for open vocabulary object detection

J Li, J Zhang, J Li, G Li, S Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Open vocabulary object detection (OVD) aims at seeking an optimal object detector capable
of recognizing objects from both base and novel categories. Recent advances leverage …

Promptkd: Unsupervised prompt distillation for vision-language models

Z Li, X Li, X Fu, X Zhang, W Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Prompt learning has emerged as a valuable technique in enhancing vision-language
models (VLMs) such as CLIP for downstream tasks in specific domains. Existing work mainly …

Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond

L Chen, Y Zhang, S Ren, H Zhao, Z Cai… - arxiv preprint arxiv …, 2023 - arxiv.org
In this study, we explore the potential of Multimodal Large Language Models (MLLMs) in
improving embodied decision-making processes for agents. While Large Language Models …

Improving zero-shot generalization of learned prompts via unsupervised knowledge distillation

M Mistretta, A Baldrati, M Bertini… - European Conference on …, 2024 - Springer
Abstract Vision-Language Models (VLMs) demonstrate remarkable zero-shot generalization
to unseen tasks, but fall short of the performance of supervised methods in generalizing to …

Simple image-level classification improves open-vocabulary object detection

R Fang, G Pang, X Bai - Proceedings of the AAAI Conference on …, 2024 - ojs.aaai.org
Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a given set
of base categories on which the detection model is trained. Recent OVOD methods focus on …

[HTML][HTML] Ov-vg: A benchmark for open-vocabulary visual grounding

C Wang, W Feng, X Li, G Cheng, S Lyu, B Liu, L Chen… - Neurocomputing, 2024 - Elsevier
Open-vocabulary learning has emerged as a cutting-edge research area, particularly in light
of the widespread adoption of vision-based foundational models. Its primary objective is to …