- Academic Search

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

保存引用被引用数: 134 関連記事全 2 バージョン

[Free GPT-4]

[PDF] nowpublishers.com

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

[Free GPT-4]

[PDF] arxiv.org

Sharegpt4v: Improving large multi-modal models with better captions

L Chen, J Li, X Dong, P Zhang, C He, J Wang… - … on Computer Vision, 2024 - Springer

Modality alignment serves as the cornerstone for large multi-modal models (LMMs).
However, the impact of different attributes (eg, data type, quality, and scale) of training data …

保存引用被引用数: 443 関連記事全 3 バージョン

[Free GPT-4]

[PDF] thecvf.com

Image as a foreign language: Beit pretraining for vision and vision-language tasks

W Wang, H Bao, L Dong, J Bjorck… - Proceedings of the …, 2023 - openaccess.thecvf.com

A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …

保存引用被引用数: 449 関連記事全 5 バージョン HTMLバージョン

[Free GPT-4]

[PDF] thecvf.com

Eva: Exploring the limits of masked visual representation learning at scale

Y Fang, W Wang, B **e, Q Sun, L Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com

We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …

保存引用被引用数: 697 関連記事全 5 バージョン HTMLバージョン

[Free GPT-4]

[PDF] thecvf.com

Internimage: Exploring large-scale vision foundation models with deformable convolutions

W Wang, J Dai, Z Chen, Z Huang, Z Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Compared to the great progress of large-scale vision transformers (ViTs) in recent years,
large-scale models based on convolutional neural networks (CNNs) are still in an early …

保存引用被引用数: 791 関連記事全 8 バージョン HTMLバージョン

[Free GPT-4]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

保存引用被引用数: 238 関連記事全 26 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Cogvlm: Visual expert for pretrained language models

W Wang, Q Lv, W Yu, W Hong, J Qi, Y Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

We introduce CogVLM, a powerful open-source visual language foundation model. Different
from the popular shallow alignment method which maps image features into the input space …

保存引用被引用数: 551 関連記事全 3 バージョン HTMLバージョン

[Free GPT-4]

[PDF] thecvf.com

Generalized decoding for pixel, image, and language

X Zou, ZY Dou, J Yang, Z Gan, L Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present X-Decoder, a generalized decoding model that can predict pixel-level
segmentation and language tokens seamlessly. X-Decoder takes as input two types of …

保存引用被引用数: 250 関連記事全 6 バージョン HTMLバージョン

[Free GPT-4]

[PDF] thecvf.com

A simple framework for open-vocabulary segmentation and detection

H Zhang, F Li, X Zou, S Liu, C Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we present OpenSeeD, a simple Open-vocabulary Segmentation and Detection
framework that learns from different segmentation and detection datasets. To bridge the gap …

保存引用被引用数: 159 関連記事全 5 バージョン HTMLバージョン

アラートを作成

引用

検索オプション

マイライブラリに保存しました

Glipv2: Unifying localization and vision-language understanding

Foundation Models Defining a New Era in Vision: a Survey and Outlook

Vision-language pre-training: Basics, recent advances, and future trends

Sharegpt4v: Improving large multi-modal models with better captions

Image as a foreign language: Beit pretraining for vision and vision-language tasks

Eva: Exploring the limits of masked visual representation learning at scale

Internimage: Exploring large-scale vision foundation models with deformable convolutions

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

Cogvlm: Visual expert for pretrained language models

Generalized decoding for pixel, image, and language

A simple framework for open-vocabulary segmentation and detection