- Academic Search

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Save Cite Cited by 133 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] nowpublishers.com

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Save Cite Cited by 198 Related articles All 7 versions Free GPT-4 Library Search View as HTML

[Free GPT-4]

[PDF] arxiv.org

Minigpt-4: Enhancing vision-language understanding with advanced large language models

D Zhu, J Chen, X Shen, X Li, M Elhoseiny - arxiv preprint arxiv …, 2023 - arxiv.org

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly
generating websites from handwritten text and identifying humorous elements within …

Save Cite Cited by 2419 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Laion-5b: An open large-scale dataset for training next generation image-text models

C Schuhmann, R Beaumont, R Vencu… - Advances in …, 2022 - proceedings.neurips.cc

Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of
training on large amounts of noisy image-text data, without relying on expensive accurate …

Save Cite Cited by 3014 Related articles All 12 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] 3dvar.com

[PDF][PDF] Scaling autoregressive models for content-rich text-to-image generation

J Yu, Y Xu, JY Koh, T Luong, G Baid, Z Wang… - arxiv preprint arxiv …, 2022 - 3dvar.com

Abstract We present the Pathways [1] Autoregressive Text-to-Image (Parti) model, which
generates high-fidelity photorealistic images and supports content-rich synthesis involving …

Save Cite Cited by 1092 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Flava: A foundational language and vision alignment model

A Singh, R Hu, V Goswami… - Proceedings of the …, 2022 - openaccess.thecvf.com

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally …

Save Cite Cited by 723 Related articles All 6 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

S Changpinyo, P Sharma, N Ding… - Proceedings of the …, 2021 - openaccess.thecvf.com

The availability of large-scale image captioning and visual question answering datasets has
contributed significantly to recent successes in vision-and-language pre-training. However …

Save Cite Cited by 1067 Related articles All 9 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip

Q Yu, J He, X Deng, X Shen… - Advances in Neural …, 2023 - proceedings.neurips.cc

Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing
objects from an open set of categories in diverse environments. One way to address this …

Save Cite Cited by 126 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Winoground: Probing vision and language models for visio-linguistic compositionality

T Thrush, R Jiang, M Bartolo, A Singh… - Proceedings of the …, 2022 - openaccess.thecvf.com

We present a novel task and dataset for evaluating the ability of vision and language models
to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two …

Save Cite Cited by 389 Related articles All 6 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Vision-language models for vision tasks: A survey

J Zhang, J Huang, S **, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

Save Cite Cited by 413 Related articles All 9 versions Free GPT-4

Create alert

Cite

Advanced search

Saved to My library

Connecting vision and language with localized narratives

Foundation Models Defining a New Era in Vision: a Survey and Outlook

Vision-language pre-training: Basics, recent advances, and future trends

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Laion-5b: An open large-scale dataset for training next generation image-text models

[PDF][PDF] Scaling autoregressive models for content-rich text-to-image generation

Flava: A foundational language and vision alignment model

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip

Winoground: Probing vision and language models for visio-linguistic compositionality

Vision-language models for vision tasks: A survey