Multimodal intelligence: Representation learning, information fusion, and applications

C Zhang, Z Yang, X He, L Deng - IEEE Journal of Selected …, 2020‏ - ieeexplore.ieee.org
Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …

Image-text retrieval: A survey on recent research and development

M Cao, S Li, J Li, L Nie, M Zhang - arxiv preprint arxiv:2203.14713, 2022‏ - arxiv.org
In the past few years, cross-modal image-text retrieval (ITR) has experienced increased
interest in the research community due to its excellent research value and broad real-world …

Recognize anything: A strong image tagging model

Y Zhang, X Huang, J Ma, Z Li, Z Luo… - Proceedings of the …, 2024‏ - openaccess.thecvf.com
Abstract We present the Recognize Anything Model (RAM): a strong foundation model for
image tagging. RAM makes a substantial step for foundation models in computer vision …

Medclip: Contrastive learning from unpaired medical images and text

Z Wang, Z Wu, D Agarwal, J Sun - arxiv preprint arxiv:2210.10163, 2022‏ - arxiv.org
Existing vision-text contrastive learning like CLIP aims to match the paired image and
caption embeddings while pushing others apart, which improves representation …

Training-free structured diffusion guidance for compositional text-to-image synthesis

W Feng, X He, TJ Fu, V Jampani, A Akula… - arxiv preprint arxiv …, 2022‏ - arxiv.org
Large-scale diffusion models have achieved state-of-the-art results on text-to-image
synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we …

Crepe: Can vision-language foundation models reason compositionally?

Z Ma, J Hong, MO Gul, M Gandhi… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
A fundamental characteristic common to both human vision and natural language is their
compositional nature. Yet, despite the performance gains contributed by large vision and …

Ernie-vil: Knowledge enhanced vision-language representations through scene graphs

F Yu, J Tang, W Yin, Y Sun, H Tian, H Wu… - Proceedings of the AAAI …, 2021‏ - ojs.aaai.org
We propose a knowledge-enhanced approach, ERNIE-ViL, which incorporates structured
knowledge obtained from scene graphs to learn joint representations of vision-language …

Learning the best pooling strategy for visual semantic embedding

J Chen, H Hu, H Wu, Y Jiang… - Proceedings of the IEEE …, 2021‏ - openaccess.thecvf.com
Abstract Visual Semantic Embedding (VSE) is a dominant approach for vision-language
retrieval, which aims at learning a deep embedding space such that visual data are …

Taco: Token-aware cascade contrastive learning for video-text alignment

J Yang, Y Bisk, J Gao - Proceedings of the IEEE/CVF …, 2021‏ - openaccess.thecvf.com
Contrastive learning has been widely used to train transformer-based vision-language
models for video-text alignment and multi-modal representation learning. This paper …

Fine-grained video-text retrieval with hierarchical graph reasoning

S Chen, Y Zhao, Q **, Q Wu - Proceedings of the IEEE/CVF …, 2020‏ - openaccess.thecvf.com
Cross-modal retrieval between videos and texts has attracted growing attentions due to the
rapid emergence of videos on the web. The current dominant approach is to learn a joint …