Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Vlp: A survey on vision-language pre-training
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …
such as computer vision (CV) and natural language processing (NLP) to a new era …
Imagebind: One embedding space to bind them all
We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
Unmasked teacher: Towards training-efficient video foundation models
Abstract Video Foundation Models (VFMs) have received limited exploration due to high
computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …
computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …
Multimodal learning with transformers: A survey
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
Socratic models: Composing zero-shot multimodal reasoning with language
Large pretrained (eg," foundation") models exhibit distinct capabilities depending on the
domain of data they are trained on. While these domains are generic, they may only barely …
domain of data they are trained on. While these domains are generic, they may only barely …
X-clip: End-to-end multi-grained contrastive learning for video-text retrieval
Video-text retrieval has been a crucial and fundamental task in multi-modal research. The
development of video-text retrieval has been considerably promoted by large-scale multi …
development of video-text retrieval has been considerably promoted by large-scale multi …
Prompting visual-language models for efficient video understanding
Image-based visual-language (I-VL) pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, revealing remarkable ability for …
visual-textual representations from large-scale web data, revealing remarkable ability for …
Motionclip: Exposing human motion generation to clip space
We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding
that is disentangled, well behaved, and supports highly semantic textual descriptions …
that is disentangled, well behaved, and supports highly semantic textual descriptions …
Cris: Clip-driven referring image segmentation
Referring image segmentation aims to segment a referent via a natural linguistic expression.
Due to the distinct data properties between text and image, it is challenging for a network to …
Due to the distinct data properties between text and image, it is challenging for a network to …