- Academic Search

PKA Vasu, H Pouransari, F Faghri… - Proceedings of the …, 2024 - openaccess.thecvf.com

Contrastive pre-training of image-text foundation models such as CLIP demonstrated
excellent zero-shot performance and improved robustness on a wide range of downstream …

Enregistrer Citer Cité 29 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Masked Image Modeling: A Survey

V Hondru, FA Croitoru, S Minaee, RT Ionescu… - arxiv preprint arxiv …, 2024 - arxiv.org

In this work, we survey recent studies on masked image modeling (MIM), an approach that
emerged as a powerful self-supervised learning technique in computer vision. The MIM task …

Enregistrer Citer Cité 3 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

Y **e, K Yang, N Yang, W Deng, X Dai, T Gu… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advances in Large Language Models (LLMs) have catalyzed the development of
Large Multimodal Models (LMMs). However, existing research primarily focuses on tuning …

Enregistrer Citer Cité 1 fois Autres articles Les 2 versions Free GPT-4 Version HTML

Motion-Aware Mask Feature Reconstruction for Skeleton-Based Action Recognition

X Zhu, X Shu, J Tang - … on Circuits and Systems for Video …, 2024 - ieeexplore.ieee.org

Despite recent advancements in masked skeleton modeling and visual-language pre-
training, no method has yet been proposed to explore capturing and utilizing the rich …

Enregistrer Citer Cité 4 fois Autres articles

An Unsupervised Vision-related Keywords Retrieval and Fusion Method for Visual Storytelling

B Li, C Ma, X Gao, G Jia - 2023 IEEE 35th International …, 2023 - ieeexplore.ieee.org

Visual storytelling is a multi-modal generation task aiming to generate a coherent story for a
sequence of images. Previous visual storytelling models utilize task-beneficial non-visual …

Enregistrer Citer Cité 1 fois Autres articles Les 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

M Ahmadian, F Guerin, A Gilbert - arxiv preprint arxiv:2406.03447, 2024 - arxiv.org

This paper demonstrates a self-supervised approach for learning semantic video
representations. Recent vision studies show that a masking strategy for vision and natural …

Enregistrer Citer Cité 1 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Simplifying CLIP: Unleashing the Power of Large-Scale Models on Consumer-level Computers

H Liu - arxiv preprint arxiv:2411.14789, 2024 - arxiv.org

Contrastive Language-Image Pre-training (CLIP) has attracted a surge of attention for its
superior zero-shot performance and excellent transferability to downstream tasks. However …

Enregistrer Citer Autres articles Version HTML

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

Mobileclip: Fast image-text models through multi-modal reinforced training

Masked Image Modeling: A Survey

Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

Motion-Aware Mask Feature Reconstruction for Skeleton-Based Action Recognition

An Unsupervised Vision-related Keywords Retrieval and Fusion Method for Visual Storytelling

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Simplifying CLIP: Unleashing the Power of Large-Scale Models on Consumer-level Computers