- Academic Search

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Zapisz Cytuj Cytowane przez 139 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]
[DeepSeek]

[PDF] nowpublishers.com

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Zapisz Cytuj Cytowane przez 199 Powiązane artykuły Wszystkie wersje 7 Wyszukiwanie bibliotek Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

Q Ye, H Xu, J Ye, M Yan, A Hu, H Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Multi-modal Large Language Models (MLLMs) have demonstrated impressive
instruction abilities across various open-ended tasks. However previous methods have …

Zapisz Cytuj Cytowane przez 340 Powiązane artykuły Wszystkie wersje 4 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Scaling language-image pre-training via masking

Y Li, H Fan, R Hu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract We present Fast Language-Image Pre-training (FLIP), a simple and more efficient
method for training CLIP. Our method randomly masks out and removes a large portion of …

Zapisz Cytuj Cytowane przez 316 Powiązane artykuły Wszystkie wersje 6 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Contrastive audio-visual masked autoencoder

Y Gong, A Rouditchenko, AH Liu, D Harwath… - arxiv preprint arxiv …, 2022 - arxiv.org

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single
modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio …

Zapisz Cytuj Cytowane przez 143 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

Veclip: Improving clip training via visual-enriched captions

Z Lai, H Zhang, B Zhang, W Wu, H Bai… - … on Computer Vision, 2024 - Springer

Large-scale web-crawled datasets are fundamental for the success of pre-training vision-
language models, such as CLIP. However, the inherent noise and potential irrelevance of …

Zapisz Cytuj Cytowane przez 18 Powiązane artykuły Wszystkie wersje 3

[Free GPT-4]
[DeepSeek]

[PDF] frontiersin.org

Vision-language models for medical report generation and visual question answering: A review

I Hartsock, G Rasool - Frontiers in Artificial Intelligence, 2024 - frontiersin.org

Medical vision-language models (VLMs) combine computer vision (CV) and natural
language processing (NLP) to analyze visual and textual medical data. Our paper reviews …

Zapisz Cytuj Cytowane przez 43 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Hard patches mining for masked image modeling

H Wang, K Song, J Fan, Y Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Masked image modeling (MIM) has attracted much research attention due to its promising
potential for learning scalable visual representations. In typical approaches, models usually …

Zapisz Cytuj Cytowane przez 60 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Context-aware alignment and mutual masking for 3d-language pre-training

Z **, M Hayat, Y Yang, Y Guo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract 3D visual language reasoning plays an important role in effective human-computer
interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre …

Zapisz Cytuj Cytowane przez 45 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Deep learning for cross-domain data fusion in urban computing: Taxonomy, advances, and outlook

X Zou, Y Yan, X Hao, Y Hu, H Wen, E Liu, J Zhang… - Information …, 2025 - Elsevier

As cities continue to burgeon, Urban Computing emerges as a pivotal discipline for
sustainable development by harnessing the power of cross-domain data fusion from diverse …

Zapisz Cytuj Cytowane przez 24 Powiązane artykuły Wszystkie wersje 3

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

Masked vision and language modeling for multi-modal representation learning

Foundation Models Defining a New Era in Vision: a Survey and Outlook

Vision-language pre-training: Basics, recent advances, and future trends

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

Scaling language-image pre-training via masking

Contrastive audio-visual masked autoencoder

Veclip: Improving clip training via visual-enriched captions

Vision-language models for medical report generation and visual question answering: A review

Hard patches mining for masked image modeling

Context-aware alignment and mutual masking for 3d-language pre-training

Deep learning for cross-domain data fusion in urban computing: Taxonomy, advances, and outlook