Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Image-text retrieval: A survey on recent research and development
In the past few years, cross-modal image-text retrieval (ITR) has experienced increased
interest in the research community due to its excellent research value and broad real-world …
interest in the research community due to its excellent research value and broad real-world …
Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition
In recent years, the growing number of medical imaging studies is placing an ever-
increasing burden on radiologists. Deep learning provides a promising solution for …
increasing burden on radiologists. Deep learning provides a promising solution for …
Negative-aware attention framework for image-text matching
Image-text matching, as a fundamental task, bridges the gap between vision and language.
The key of this task is to accurately measure similarity between these two modalities. Prior …
The key of this task is to accurately measure similarity between these two modalities. Prior …
Multi-granularity cross-modal alignment for generalized medical visual representation learning
Learning medical visual representations directly from paired radiology reports has become
an emerging topic in representation learning. However, existing medical image-text joint …
an emerging topic in representation learning. However, existing medical image-text joint …
Towards artificial general intelligence via a multimodal foundation model
The fundamental goal of artificial intelligence (AI) is to mimic the core cognitive activities of
human. Despite tremendous success in the AI research, most of existing methods have only …
human. Despite tremendous success in the AI research, most of existing methods have only …
Learning with noisy correspondence for cross-modal matching
Cross-modal matching, which aims to establish the correspondence between two different
modalities, is fundamental to a variety of tasks such as cross-modal retrieval and vision-and …
modalities, is fundamental to a variety of tasks such as cross-modal retrieval and vision-and …
Fine-grained image-text matching by cross-modal hard aligning network
Z Pan, F Wu, B Zhang - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com
Current state-of-the-art image-text matching methods implicitly align the visual-semantic
fragments, like regions in images and words in sentences, and adopt cross-attention …
fragments, like regions in images and words in sentences, and adopt cross-attention …
Learning semantic relationship among instances for image-text matching
Image-text matching, a bridge connecting image and language, is an important task, which
generally learns a holistic cross-modal embedding to achieve a high-quality semantic …
generally learns a holistic cross-modal embedding to achieve a high-quality semantic …
Vista: Vision and scene text aggregation for cross-modal retrieval
Visual appearance is considered to be the most important cue to understand images for
cross-modal retrieval, while sometimes the scene text appearing in images can provide …
cross-modal retrieval, while sometimes the scene text appearing in images can provide …