Cross-Modal Retrieval: A Review of Methodologies, Datasets, and Future Perspectives

Z Han, A Azman, MR Mustaffa, FB Khalid - IEEE Access, 2024 - ieeexplore.ieee.org
With the rapid development of science and technology, all types of mixed media contain
large amounts of data. Traditional single multimedia data can no longer satisfy daily …

Fashionsap: Symbols and attributes prompt for fine-grained fashion vision-language pre-training

Y Han, L Zhang, Q Chen, Z Chen, Z Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Fashion vision-language pre-training models have shown efficacy for a wide range of
downstream tasks. However, general vision-language pre-training models pay less attention …

Question-conditioned debiasing with focal visual context fusion for visual question answering

J Liu, GX Wang, CF Fan, F Zhou, HJ Xu - Knowledge-Based Systems, 2023 - Elsevier
Abstract Existing Visual Question Answering models suffer from the language prior, where
the answers provided by the models overly rely on the correlations between questions and …

All in one: Exploring unified vision-language tracking with multi-modal alignment

C Zhang, X Sun, Y Yang, L Liu, Q Liu, X Zhou… - Proceedings of the 31st …, 2023 - dl.acm.org
Current mainstream vision-language (VL) tracking framework consists of three parts, ie, a
visual feature extractor, a language feature extractor, and a fusion model. To pursue better …

Deep supervised dual cycle adversarial network for cross-modal retrieval

L Liao, M Yang, B Zhang - … on Circuits and Systems for Video …, 2022 - ieeexplore.ieee.org
Cross-modal retrieval tasks, which are more natural and challenging than traditional
retrieval tasks, have attracted increasing interest from researchers in recent years. Although …

Contrastive label correlation enhanced unified hashing encoder for cross-modal retrieval

H Wu, L Zhang, Q Chen, Y Deng, J Siebert… - Proceedings of the 31st …, 2022 - dl.acm.org
Cross-modal hashing (CMH) has been widely used in multimedia retrieval applications for
its low storage cost and fast indexing speed. Thanks to the success of deep learning, cross …

[HTML][HTML] Partial visual-semantic embedding: Fine-grained outfit image representation with massive volumes of tags via angular-based contrastive learning

R Shimizu, T Nakamura, M Goto - Knowledge-Based Systems, 2023 - Elsevier
A novel technology named fashion intelligence system, which quantifies ambiguous
expressions unique to fashion, such as “casual,”“adult-casual,” and “office-casual,” was …

MiC: Image-text Matching in Circles with cross-modal generative knowledge enhancement

X Pu, Y Chen, L Yuan, Y Zhang, H Li, L **g… - Knowledge-Based …, 2024 - Elsevier
Image-text matching is a challenging task due to vast discrepancies between the visual and
textual modalities. Existing solutions tend to focus on a limited set of strongly aligned or …

Multimodal Distillation Pre-training Model for Ultrasound Dynamic Images Annotation

X Chen, J Ke, Y Zhang, J Gou, A Shen… - IEEE Journal of …, 2024 - ieeexplore.ieee.org
With the development of medical technology, ultrasonography has become an important
diagnostic method in doctors' clinical work. However, compared with the static medical …

Collaborative group: Composed image retrieval via consensus learning from noisy annotations

X Zhang, Z Zheng, L Zhu, Y Yang - Knowledge-Based Systems, 2024 - Elsevier
Composed image retrieval extends content-based image retrieval systems by enabling
users to search using reference images and captions that describe their intention. Despite …