Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022‏ - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

A review of recurrent neural networks: LSTM cells and network architectures

Y Yu, X Si, C Hu, J Zhang - Neural computation, 2019‏ - direct.mit.edu
Recurrent neural networks (RNNs) have been widely adopted in research areas concerned
with sequential data, such as text, audio, and video. However, RNNs consisting of sigma …

Deep hierarchical semantic segmentation

L Li, T Zhou, W Wang, J Li… - Proceedings of the IEEE …, 2022‏ - openaccess.thecvf.com
Humans are able to recognize structured relations in observation, allowing us to decompose
complex scenes into simpler parts and abstract the visual world in multiple levels. However …

Visual semantic reasoning for image-text matching

K Li, Y Zhang, K Li, Y Li, Y Fu - Proceedings of the IEEE …, 2019‏ - openaccess.thecvf.com
Image-text matching has been a hot research topic bridging the vision and language areas.
It remains challenging because the current representation of image usually lacks global …

Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval

H Chen, G Ding, X Liu, Z Lin, J Liu… - Proceedings of the …, 2020‏ - openaccess.thecvf.com
Enabling bi-directional retrieval of images and texts is important for understanding the
correspondence between vision and language. Existing methods leverage the attention …

Hierarchical deep click feature prediction for fine-grained image recognition

J Yu, M Tan, H Zhang, Y Rui… - IEEE transactions on …, 2019‏ - ieeexplore.ieee.org
The click feature of an image, defined as the user click frequency vector of the image on a
predefined word vocabulary, is known to effectively reduce the semantic gap for fine-grained …

Stacked cross attention for image-text matching

KH Lee, X Chen, G Hua, H Hu… - Proceedings of the …, 2018‏ - openaccess.thecvf.com
In this paper, we study the problem of image-text matching. Inferring the latent semantic
alignment between objects or other salient stuff (eg snow, sky, lawn) and the corresponding …

Context-aware attention network for image-text retrieval

Q Zhang, Z Lei, Z Zhang, SZ Li - Proceedings of the IEEE …, 2020‏ - openaccess.thecvf.com
As a typical cross-modal problem, image-text bi-directional retrieval relies heavily on the
joint embedding learning and similarity measure for each image-text pair. It remains …

Semantically self-aligned network for text-to-image part-aware person re-identification

Z Ding, C Ding, Z Shao, D Tao - arxiv preprint arxiv:2107.12666, 2021‏ - arxiv.org
Text-to-image person re-identification (ReID) aims to search for images containing a person
of interest using textual descriptions. However, due to the significant modality gap and the …

Camp: Cross-modal adaptive message passing for text-image retrieval

Z Wang, X Liu, H Li, L Sheng, J Yan… - Proceedings of the …, 2019‏ - openaccess.thecvf.com
Text-image cross-modal retrieval is a challenging task in the field of language and vision.
Most previous approaches independently embed images and sentences into a joint …