Cross-modal retrieval: a systematic review of methods and future directions
With the exponential surge in diverse multimodal data, traditional unimodal retrieval
methods struggle to meet the needs of users seeking access to data across various …
methods struggle to meet the needs of users seeking access to data across various …
Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation
Abstract Scene Graph Generation, which generally follows a regular encoder-decoder
pipeline, aims to first encode the visual contents within the given image and then parse them …
pipeline, aims to first encode the visual contents within the given image and then parse them …
Token shift transformer for video classification
Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals
(eg, NLP and Image Content Understanding). As a potential alternative to convolutional …
(eg, NLP and Image Content Understanding). As a potential alternative to convolutional …
Dual learning with dynamic knowledge distillation for partially relevant video retrieval
J Dong, M Zhang, Z Zhang, X Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Almost all previous text-to-video retrieval works assume that videos are pre-trimmed with
short durations. However, in practice, videos are generally untrimmed containing much …
short durations. However, in practice, videos are generally untrimmed containing much …
Personalized fashion compatibility modeling via metapath-guided heterogeneous graph learning
Fashion Compatibility Modeling (FCM) is a new yet challenging task, which aims to
automatically access the matching degree among a set of complementary items. Most of …
automatically access the matching degree among a set of complementary items. Most of …
Reading-strategy inspired visual representation learning for text-to-video retrieval
This paper aims for the task of text-to-video retrieval, where given a query in the form of a
natural-language sentence, it is asked to retrieve videos which are semantically relevant to …
natural-language sentence, it is asked to retrieve videos which are semantically relevant to …
Partially relevant video retrieval
Current methods for text-to-video retrieval (T2VR) are trained and tested on video-captioning
oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is …
oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is …
Scene graph refinement network for visual question answering
Visual Question Answering aims to answer the free-form natural language question based
on the visual clues in a given image. It is a difficult problem as it requires understanding the …
on the visual clues in a given image. It is a difficult problem as it requires understanding the …
Hierarchical local-global transformer for temporal sentence grounding
This article studies the multimedia problem of temporal sentence grounding (TSG), which
aims to accurately determine the specific video segment in an untrimmed video according to …
aims to accurately determine the specific video segment in an untrimmed video according to …
More: Multi-order relation mining for dense captioning in 3d scenes
Abstract 3D dense captioning is a recently-proposed novel task, where point clouds contain
more geometric information than the 2D counterpart. However, it is also more challenging …
more geometric information than the 2D counterpart. However, it is also more challenging …