- Academic Search

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Zapisz Cytuj Cytowane przez 197 Powiązane artykuły Wszystkie wersje 7 Wyszukiwanie bibliotek Wersja HTML

[Free GPT-4]

[PDF] nowpublishers.com

Semantic image segmentation: Two decades of research

G Csurka, R Volpi, B Chidlovskii - Foundations and Trends® …, 2022 - nowpublishers.com

Semantic image segmentation (SiS) plays a fundamental role in a broad variety of computer
vision applications, providing key information for the global understanding of an image. This …

Zapisz Cytuj Cytowane przez 49 Powiązane artykuły Wszystkie wersje 7 Wyszukiwanie bibliotek Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

S Liu, Z Zeng, T Ren, F Li, H Zhang, J Yang… - … on Computer Vision, 2024 - Springer

In this paper, we develop an open-set object detector, called Grounding DINO, by marrying
Transformer-based detector DINO with grounded pre-training, which can detect arbitrary …

Zapisz Cytuj Cytowane przez 1580 Powiązane artykuły Wszystkie wersje 4

[Free GPT-4]

[PDF] thecvf.com

Unleashing text-to-image diffusion models for visual perception

W Zhao, Y Rao, Z Liu, B Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Diffusion models (DMs) have become the new trend of generative models and have
demonstrated a powerful ability of conditional synthesis. Among those, text-to-image …

Zapisz Cytuj Cytowane przez 188 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

[Free GPT-4]

[PDF] neurips.cc

Scalable 3d captioning with pretrained models

T Luo, C Rockwell, H Lee… - Advances in Neural …, 2024 - proceedings.neurips.cc

We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects.
This approach utilizes pretrained models from image captioning, image-text alignment, and …

Zapisz Cytuj Cytowane przez 132 Powiązane artykuły Wszystkie wersje 6 Wersja HTML

[Free GPT-4]

[PDF] thecvf.com

Gres: Generalized referring expression segmentation

C Liu, H Ding, X Jiang - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com

Abstract Referring Expression Segmentation (RES) aims to generate a segmentation mask
for the object described by a given language expression. Existing classic RES datasets and …

Zapisz Cytuj Cytowane przez 152 Powiązane artykuły Wszystkie wersje 6 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Tip-adapter: Training-free adaption of clip for few-shot classification

R Zhang, W Zhang, R Fang, P Gao, K Li, J Dai… - European conference on …, 2022 - Springer

Abstract Contrastive Vision-Language Pre-training, known as CLIP, has provided a new
paradigm for learning visual representations using large-scale image-text pairs. It shows …

Zapisz Cytuj Cytowane przez 344 Powiązane artykuły Wszystkie wersje 6

[Free GPT-4]

[PDF] arxiv.org

Scaling open-vocabulary image segmentation with image-level labels

G Ghiasi, X Gu, Y Cui, TY Lin - European Conference on Computer Vision, 2022 - Springer

We design an open-vocabulary image segmentation model to organize an image into
meaningful regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite …

Zapisz Cytuj Cytowane przez 456 Powiązane artykuły Wszystkie wersje 5

[Free GPT-4]

[PDF] thecvf.com

Pointclip: Point cloud understanding by clip

R Zhang, Z Guo, W Zhang, K Li… - Proceedings of the …, 2022 - openaccess.thecvf.com

Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training
(CLIP) have shown inspirational performance on 2D visual recognition, which learns to …

Zapisz Cytuj Cytowane przez 463 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

[Free GPT-4]

[PDF] thecvf.com

What does clip know about a red circle? visual prompt engineering for vlms

A Shtedritski, C Rupprecht… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract Large-scale Vision-Language Models, such as CLIP, learn powerful image-text
representations that have found numerous applications, from zero-shot classification to text …

Zapisz Cytuj Cytowane przez 125 Powiązane artykuły Wszystkie wersje 7 Wersja HTML

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

Mattnet: Modular attention network for referring expression comprehension

Vision-language pre-training: Basics, recent advances, and future trends

Semantic image segmentation: Two decades of research

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Unleashing text-to-image diffusion models for visual perception

Scalable 3d captioning with pretrained models

Gres: Generalized referring expression segmentation

Tip-adapter: Training-free adaption of clip for few-shot classification

Scaling open-vocabulary image segmentation with image-level labels

Pointclip: Point cloud understanding by clip

What does clip know about a red circle? visual prompt engineering for vlms