Mobileclip: Fast image-text models through multi-modal reinforced training
Contrastive pre-training of image-text foundation models such as CLIP demonstrated
excellent zero-shot performance and improved robustness on a wide range of downstream …
excellent zero-shot performance and improved robustness on a wide range of downstream …
Masked Image Modeling: A Survey
In this work, we survey recent studies on masked image modeling (MIM), an approach that
emerged as a powerful self-supervised learning technique in computer vision. The MIM task …
emerged as a powerful self-supervised learning technique in computer vision. The MIM task …
Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension
Recent advances in Large Language Models (LLMs) have catalyzed the development of
Large Multimodal Models (LMMs). However, existing research primarily focuses on tuning …
Large Multimodal Models (LMMs). However, existing research primarily focuses on tuning …
Motion-Aware Mask Feature Reconstruction for Skeleton-Based Action Recognition
Despite recent advancements in masked skeleton modeling and visual-language pre-
training, no method has yet been proposed to explore capturing and utilizing the rich …
training, no method has yet been proposed to explore capturing and utilizing the rich …
An Unsupervised Vision-related Keywords Retrieval and Fusion Method for Visual Storytelling
B Li, C Ma, X Gao, G Jia - 2023 IEEE 35th International …, 2023 - ieeexplore.ieee.org
Visual storytelling is a multi-modal generation task aiming to generate a coherent story for a
sequence of images. Previous visual storytelling models utilize task-beneficial non-visual …
sequence of images. Previous visual storytelling models utilize task-beneficial non-visual …
FILS: Self-Supervised Video Feature Prediction In Semantic Language Space
This paper demonstrates a self-supervised approach for learning semantic video
representations. Recent vision studies show that a masking strategy for vision and natural …
representations. Recent vision studies show that a masking strategy for vision and natural …
Simplifying CLIP: Unleashing the Power of Large-Scale Models on Consumer-level Computers
H Liu - arxiv preprint arxiv:2411.14789, 2024 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) has attracted a surge of attention for its
superior zero-shot performance and excellent transferability to downstream tasks. However …
superior zero-shot performance and excellent transferability to downstream tasks. However …