Mobileclip: Fast image-text models through multi-modal reinforced training

PKA Vasu, H Pouransari, F Faghri… - Proceedings of the …, 2024 - openaccess.thecvf.com
Contrastive pre-training of image-text foundation models such as CLIP demonstrated
excellent zero-shot performance and improved robustness on a wide range of downstream …

Masked Image Modeling: A Survey

V Hondru, FA Croitoru, S Minaee, RT Ionescu… - arxiv preprint arxiv …, 2024 - arxiv.org
In this work, we survey recent studies on masked image modeling (MIM), an approach that
emerged as a powerful self-supervised learning technique in computer vision. The MIM task …

Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

Y **e, K Yang, N Yang, W Deng, X Dai, T Gu… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advances in Large Language Models (LLMs) have catalyzed the development of
Large Multimodal Models (LMMs). However, existing research primarily focuses on tuning …

Motion-Aware Mask Feature Reconstruction for Skeleton-Based Action Recognition

X Zhu, X Shu, J Tang - … on Circuits and Systems for Video …, 2024 - ieeexplore.ieee.org
Despite recent advancements in masked skeleton modeling and visual-language pre-
training, no method has yet been proposed to explore capturing and utilizing the rich …

An Unsupervised Vision-related Keywords Retrieval and Fusion Method for Visual Storytelling

B Li, C Ma, X Gao, G Jia - 2023 IEEE 35th International …, 2023 - ieeexplore.ieee.org
Visual storytelling is a multi-modal generation task aiming to generate a coherent story for a
sequence of images. Previous visual storytelling models utilize task-beneficial non-visual …

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

M Ahmadian, F Guerin, A Gilbert - arxiv preprint arxiv:2406.03447, 2024 - arxiv.org
This paper demonstrates a self-supervised approach for learning semantic video
representations. Recent vision studies show that a masking strategy for vision and natural …

Simplifying CLIP: Unleashing the Power of Large-Scale Models on Consumer-level Computers

H Liu - arxiv preprint arxiv:2411.14789, 2024 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) has attracted a surge of attention for its
superior zero-shot performance and excellent transferability to downstream tasks. However …