Multimodal intelligence: Representation learning, information fusion, and applications
Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …
natural language processing since 2010. Each of these tasks involves a single modality in …
Image-text retrieval: A survey on recent research and development
In the past few years, cross-modal image-text retrieval (ITR) has experienced increased
interest in the research community due to its excellent research value and broad real-world …
interest in the research community due to its excellent research value and broad real-world …
Recognize anything: A strong image tagging model
Abstract We present the Recognize Anything Model (RAM): a strong foundation model for
image tagging. RAM makes a substantial step for foundation models in computer vision …
image tagging. RAM makes a substantial step for foundation models in computer vision …
Medclip: Contrastive learning from unpaired medical images and text
Existing vision-text contrastive learning like CLIP aims to match the paired image and
caption embeddings while pushing others apart, which improves representation …
caption embeddings while pushing others apart, which improves representation …
Training-free structured diffusion guidance for compositional text-to-image synthesis
Large-scale diffusion models have achieved state-of-the-art results on text-to-image
synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we …
synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we …
Crepe: Can vision-language foundation models reason compositionally?
A fundamental characteristic common to both human vision and natural language is their
compositional nature. Yet, despite the performance gains contributed by large vision and …
compositional nature. Yet, despite the performance gains contributed by large vision and …
Ernie-vil: Knowledge enhanced vision-language representations through scene graphs
We propose a knowledge-enhanced approach, ERNIE-ViL, which incorporates structured
knowledge obtained from scene graphs to learn joint representations of vision-language …
knowledge obtained from scene graphs to learn joint representations of vision-language …
Learning the best pooling strategy for visual semantic embedding
Abstract Visual Semantic Embedding (VSE) is a dominant approach for vision-language
retrieval, which aims at learning a deep embedding space such that visual data are …
retrieval, which aims at learning a deep embedding space such that visual data are …
Taco: Token-aware cascade contrastive learning for video-text alignment
Contrastive learning has been widely used to train transformer-based vision-language
models for video-text alignment and multi-modal representation learning. This paper …
models for video-text alignment and multi-modal representation learning. This paper …
Fine-grained video-text retrieval with hierarchical graph reasoning
Cross-modal retrieval between videos and texts has attracted growing attentions due to the
rapid emergence of videos on the web. The current dominant approach is to learn a joint …
rapid emergence of videos on the web. The current dominant approach is to learn a joint …