BRAVE: Broadening the visual encoding of vision-language models

OF Kar, A Tonioni, P Poklukar, A Kulshrestha… - … on Computer Vision, 2024 - Springer
Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …

A survey on open-vocabulary detection and segmentation: Past, present, and future

C Zhu, L Chen - IEEE Transactions on Pattern Analysis and …, 2024 - ieeexplore.ieee.org
As the most fundamental scene understanding tasks, object detection and segmentation
have made tremendous progress in deep learning era. Due to the expensive manual …

Proxyclip: Proxy attention improves clip for open-vocabulary segmentation

M Lan, C Chen, Y Ke, X Wang, L Feng… - European Conference on …, 2024 - Springer
Open-vocabulary semantic segmentation requires models to effectively integrate visual
representations with open-vocabulary semantic labels. While Contrastive Language-Image …

Improving medical multi-modal contrastive learning with expert annotations

Y Kumar, P Marttinen - European Conference on Computer Vision, 2024 - Springer
We introduce eCLIP, an enhanced version of the CLIP model that integrates expert
annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in …

SemiVL: semi-supervised semantic segmentation with vision-language guidance

L Hoyer, DJ Tan, MF Naeem, L Van Gool… - European Conference on …, 2024 - Springer
In semi-supervised semantic segmentation, a model is trained with a limited number of
labeled images along with a large corpus of unlabeled images to reduce the high annotation …

Contrastive localized language-image pre-training

HY Chen, Z Lai, H Zhang, X Wang, M Eichner… - arxiv preprint arxiv …, 2024 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training
vision encoders to generate image/text representations facilitating various applications …

Image segmentation in foundation model era: A survey

T Zhou, F Zhang, B Chang, W Wang, Y Yuan… - arxiv preprint arxiv …, 2024 - arxiv.org
Image segmentation is a long-standing challenge in computer vision, studied continuously
over several decades, as evidenced by seminal algorithms such as N-Cut, FCN, and …

Unimed-clip: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities

MU Khattak, S Kunhimon, M Naseer, S Khan… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-Language Models (VLMs) trained via contrastive learning have achieved notable
success in natural image tasks. However, their application in the medical domain remains …

Active data curation effectively distills large-scale multimodal models

V Udandarao, N Parthasarathy, MF Naeem… - arxiv preprint arxiv …, 2024 - arxiv.org
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into
smaller ones. Prior works have explored ever more complex KD strategies involving different …

Human Pose Descriptions and Subject-Focused Attention for Improved Zero-Shot Transfer in Human-Centric Classification Tasks

MSU Khan, MF Naeem, F Tombari, L Van Gool… - arxiv preprint arxiv …, 2024 - arxiv.org
We present a novel LLM-based pipeline for creating contextual descriptions of human body
poses in images using only auxiliary attributes. This approach facilitates the creation of the …