Vlp: A survey on vision-language pre-training
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …
such as computer vision (CV) and natural language processing (NLP) to a new era …
Image-text retrieval: A survey on recent research and development
In the past few years, cross-modal image-text retrieval (ITR) has experienced increased
interest in the research community due to its excellent research value and broad real-world …
interest in the research community due to its excellent research value and broad real-world …
Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition
In recent years, the growing number of medical imaging studies is placing an ever-
increasing burden on radiologists. Deep learning provides a promising solution for …
increasing burden on radiologists. Deep learning provides a promising solution for …
Negative-aware attention framework for image-text matching
Image-text matching, as a fundamental task, bridges the gap between vision and language.
The key of this task is to accurately measure similarity between these two modalities. Prior …
The key of this task is to accurately measure similarity between these two modalities. Prior …
Towards artificial general intelligence via a multimodal foundation model
The fundamental goal of artificial intelligence (AI) is to mimic the core cognitive activities of
human. Despite tremendous success in the AI research, most of existing methods have only …
human. Despite tremendous success in the AI research, most of existing methods have only …
Similarity reasoning and filtration for image-text matching
Image-text matching plays a critical role in bridging the vision and language, and great
progress has been made by exploiting the global alignment between image and sentence …
progress has been made by exploiting the global alignment between image and sentence …
Dual-level representation enhancement on characteristic and context for image-text retrieval
Image-text retrieval is a fundamental and vital task in multi-media retrieval and has received
growing attention since it connects heterogeneous data. Previous methods that perform well …
growing attention since it connects heterogeneous data. Previous methods that perform well …
Learning the best pooling strategy for visual semantic embedding
Abstract Visual Semantic Embedding (VSE) is a dominant approach for vision-language
retrieval, which aims at learning a deep embedding space such that visual data are …
retrieval, which aims at learning a deep embedding space such that visual data are …
Region-object relation-aware dense captioning via transformer
Dense captioning provides detailed captions of complex visual scenes. While a number of
successes have been achieved in recent years, there are still two broad limitations: 1) most …
successes have been achieved in recent years, there are still two broad limitations: 1) most …
Dynamic modality interaction modeling for image-text retrieval
Image-text retrieval is a fundamental and crucial branch in information retrieval. Although
much progress has been made in bridging vision and language, it remains challenging …
much progress has been made in bridging vision and language, it remains challenging …