Coca: Contrastive captioners are image-text foundation models
Exploring large-scale pretrained foundation models is of significant interest in computer
vision because these models can be quickly transferred to many downstream tasks. This …
vision because these models can be quickly transferred to many downstream tasks. This …
Socratic models: Composing zero-shot multimodal reasoning with language
Large pretrained (eg," foundation") models exhibit distinct capabilities depending on the
domain of data they are trained on. While these domains are generic, they may only barely …
domain of data they are trained on. While these domains are generic, they may only barely …
Videoclip: Contrastive pre-training for zero-shot video-text understanding
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot
video and text understanding, without using any labels on downstream tasks. VideoCLIP …
video and text understanding, without using any labels on downstream tasks. VideoCLIP …
Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning
Video clip retrieval and captioning tasks play an essential role in multimodal research and
are the fundamental research problem for multimodal understanding and generation. The …
are the fundamental research problem for multimodal understanding and generation. The …
X-clip: End-to-end multi-grained contrastive learning for video-text retrieval
Video-text retrieval has been a crucial and fundamental task in multi-modal research. The
development of video-text retrieval has been considerably promoted by large-scale multi …
development of video-text retrieval has been considerably promoted by large-scale multi …
Disentangled representation learning
Disentangled Representation Learning (DRL) aims to learn a model capable of identifying
and disentangling the underlying factors hidden in the observable data in representation …
and disentangling the underlying factors hidden in the observable data in representation …
Simple but effective: Clip embeddings for embodied ai
Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial
for a range of visual tasks from classification and detection to captioning and image …
for a range of visual tasks from classification and detection to captioning and image …
X-pool: Cross-modal language-video attention for text-video retrieval
In text-video retrieval, the objective is to learn a cross-modal similarity function between a
text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However …
text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However …
Bridging video-text retrieval with multiple choice questions
Pre-training a model to learn transferable video-text representation for retrieval has attracted
a lot of attention in recent years. Previous dominant works mainly adopt two separate …
a lot of attention in recent years. Previous dominant works mainly adopt two separate …
Clip2video: Mastering video-text retrieval via image clip
We present CLIP2Video network to transfer the image-language pre-training model to video-
text retrieval in an end-to-end manner. Leading approaches in the domain of video-and …
text retrieval in an end-to-end manner. Leading approaches in the domain of video-and …