Cross-modal retrieval: a systematic review of methods and future directions

T Wang, F Li, L Zhu, J Li, Z Zhang… - Proceedings of the …, 2025 - ieeexplore.ieee.org
With the exponential surge in diverse multimodal data, traditional unimodal retrieval
methods struggle to meet the needs of users seeking access to data across various …

Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation

X Dong, T Gan, X Song, J Wu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract Scene Graph Generation, which generally follows a regular encoder-decoder
pipeline, aims to first encode the visual contents within the given image and then parse them …

Token shift transformer for video classification

H Zhang, Y Hao, CW Ngo - Proceedings of the 29th ACM International …, 2021 - dl.acm.org
Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals
(eg, NLP and Image Content Understanding). As a potential alternative to convolutional …

Dual learning with dynamic knowledge distillation for partially relevant video retrieval

J Dong, M Zhang, Z Zhang, X Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Almost all previous text-to-video retrieval works assume that videos are pre-trimmed with
short durations. However, in practice, videos are generally untrimmed containing much …

Personalized fashion compatibility modeling via metapath-guided heterogeneous graph learning

W Guan, F Jiao, X Song, H Wen, CH Yeh… - Proceedings of the 45th …, 2022 - dl.acm.org
Fashion Compatibility Modeling (FCM) is a new yet challenging task, which aims to
automatically access the matching degree among a set of complementary items. Most of …

Reading-strategy inspired visual representation learning for text-to-video retrieval

J Dong, Y Wang, X Chen, X Qu, X Li… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
This paper aims for the task of text-to-video retrieval, where given a query in the form of a
natural-language sentence, it is asked to retrieve videos which are semantically relevant to …

Partially relevant video retrieval

J Dong, X Chen, M Zhang, X Yang, S Chen… - Proceedings of the 30th …, 2022 - dl.acm.org
Current methods for text-to-video retrieval (T2VR) are trained and tested on video-captioning
oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is …

Scene graph refinement network for visual question answering

T Qian, J Chen, S Chen, B Wu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Visual Question Answering aims to answer the free-form natural language question based
on the visual clues in a given image. It is a difficult problem as it requires understanding the …

Hierarchical local-global transformer for temporal sentence grounding

X Fang, D Liu, P Zhou, Z Xu, R Li - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
This article studies the multimedia problem of temporal sentence grounding (TSG), which
aims to accurately determine the specific video segment in an untrimmed video according to …

More: Multi-order relation mining for dense captioning in 3d scenes

Y Jiao, S Chen, Z Jie, J Chen, L Ma… - European Conference on …, 2022 - Springer
Abstract 3D dense captioning is a recently-proposed novel task, where point clouds contain
more geometric information than the 2D counterpart. However, it is also more challenging …