Cross-modal retrieval: a systematic review of methods and future directions
With the exponential surge in diverse multimodal data, traditional unimodal retrieval
methods struggle to meet the needs of users seeking access to data across various …
methods struggle to meet the needs of users seeking access to data across various …
Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …
have achieved outstanding performance, which pursue semantic interaction upon pre …
Cross modal retrieval with querybank normalisation
Profiting from large-scale training datasets, advances in neural architecture design and
efficient inference, joint embeddings have become the dominant approach for tackling cross …
efficient inference, joint embeddings have become the dominant approach for tackling cross …
Text-video retrieval with disentangled conceptualization and set-to-set alignment
Text-video retrieval is a challenging cross-modal task, which aims to align visual entities with
natural language descriptions. Current methods either fail to leverage the local details or are …
natural language descriptions. Current methods either fail to leverage the local details or are …
Dual alignment unsupervised domain adaptation for video-text retrieval
Video-text retrieval is an emerging stream in both computer vision and natural language
processing communities, which aims to find relevant videos given text queries. In this paper …
processing communities, which aims to find relevant videos given text queries. In this paper …
OMGH: Online manifold-guided hashing for flexible cross-modal retrieval
Cross-modal hashing hasrecently gained an increasing attention for its efficiency and fast
retrieval speed in indexing the multimedia data across different modalities. Nevertheless, the …
retrieval speed in indexing the multimedia data across different modalities. Nevertheless, the …
Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer
While Transformers have rapidly gained popularity in various computer vision applications
post-hoc explanations of their internal mechanisms remain largely unexplored. Vision …
post-hoc explanations of their internal mechanisms remain largely unexplored. Vision …
Prototype-guided knowledge transfer for federated unsupervised cross-modal hashing
Although deep cross-modal hashing methods have shown superiorities for cross-modal
retrieval recently, there is a concern about potential data privacy leakage when training the …
retrieval recently, there is a concern about potential data privacy leakage when training the …
Noise is also useful: Negative correlation-steered latent contrastive learning
How to effectively handle label noise has been one of the most practical but challenging
tasks in Deep Neural Networks (DNNs). Recent popular methods for training DNNs with …
tasks in Deep Neural Networks (DNNs). Recent popular methods for training DNNs with …
Semantic collaborative learning for cross-modal moment localization
Localizing a desired moment within an untrimmed video via a given natural language query,
ie, cross-modal moment localization, has attracted widespread research attention recently …
ie, cross-modal moment localization, has attracted widespread research attention recently …