- Academic Search

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer

In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

Uložit Citovat Počet citací tohoto článku: 220 Související články Všechny verze (počet: 10)

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Cross-modal retrieval: a systematic review of methods and future directions

T Wang, F Li, L Zhu, J Li, Z Zhang… - Proceedings of the …, 2025 - ieeexplore.ieee.org

With the exponential surge in diverse multimodal data, traditional unimodal retrieval
methods struggle to meet the needs of users seeking access to data across various …

Uložit Citovat Počet citací tohoto článku: 24 Související články Všechny verze (počet: 3)

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Socratic models: Composing zero-shot multimodal reasoning with language

A Zeng, M Attarian, B Ichter, K Choromanski… - arxiv preprint arxiv …, 2022 - arxiv.org

Large pretrained (eg," foundation") models exhibit distinct capabilities depending on the
domain of data they are trained on. While these domains are generic, they may only barely …

Uložit Citovat Počet citací tohoto článku: 526 Související články Všechny verze (počet: 6) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Frozen in time: A joint video and image encoder for end-to-end retrieval

M Bain, A Nagrani, G Varol… - Proceedings of the …, 2021 - openaccess.thecvf.com

Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …

Uložit Citovat Počet citací tohoto článku: 1180 Související články Všechny verze (počet: 12) Zobrazit jako HTML

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning

H Luo, L Ji, M Zhong, Y Chen, W Lei, N Duan, T Li - Neurocomputing, 2022 - Elsevier

Video clip retrieval and captioning tasks play an essential role in multimodal research and
are the fundamental research problem for multimodal understanding and generation. The …

Uložit Citovat Počet citací tohoto článku: 570 Související články Všechny verze (počet: 5)

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

X-clip: End-to-end multi-grained contrastive learning for video-text retrieval

Y Ma, G Xu, X Sun, M Yan, J Zhang, R Ji - Proceedings of the 30th ACM …, 2022 - dl.acm.org

Video-text retrieval has been a crucial and fundamental task in multi-modal research. The
development of video-text retrieval has been considerably promoted by large-scale multi …

Uložit Citovat Počet citací tohoto článku: 269 Související články Všechny verze (počet: 3)

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Clip2video: Mastering video-text retrieval via image clip

H Fang, P **ong, L Xu, Y Chen - arxiv preprint arxiv:2106.11097, 2021 - arxiv.org

We present CLIP2Video network to transfer the image-language pre-training model to video-
text retrieval in an end-to-end manner. Leading approaches in the domain of video-and …

Uložit Citovat Počet citací tohoto článku: 316 Související články Všechny verze (počet: 2) Zobrazit jako HTML

Clip4clip: An empirical study of clip for end to end video clip retrieval

H Luo, L Ji, M Zhong, Y Chen, W Lei, N Duan… - arxiv preprint arxiv …, 2021 - arxiv.org

Video-text retrieval plays an essential role in multi-modal research and has been widely
used in many real-world web applications. The CLIP (Contrastive Language-Image Pre …

Uložit Citovat Počet citací tohoto článku: 333 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Multi-modal transformer for video retrieval

V Gabeur, C Sun, K Alahari, C Schmid - … 28, 2020, Proceedings, Part IV 16, 2020 - Springer

The task of retrieving video content relevant to natural language queries plays a critical role
in effectively handling internet-scale datasets. Most of the existing methods for this caption-to …

Uložit Citovat Počet citací tohoto článku: 746 Související články Všechny verze (počet: 13)

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Self-supervised multimodal versatile networks

JB Alayrac, A Recasens, R Schneider… - Advances in neural …, 2020 - proceedings.neurips.cc

Videos are a rich source of multi-modal supervision. In this work, we learn representations
using self-supervision by leveraging three modalities naturally present in videos: visual …

Uložit Citovat Počet citací tohoto článku: 442 Související články Všechny verze (počet: 5) Zobrazit jako HTML

Vytvořit upozornění

Citovat

Rozšířené vyhledávání

Uloženo do Mojí knihovny

Learning joint embedding with multimodal cues for cross-modal video-text retrieval

Vlp: A survey on vision-language pre-training

Cross-modal retrieval: a systematic review of methods and future directions

Socratic models: Composing zero-shot multimodal reasoning with language

Frozen in time: A joint video and image encoder for end-to-end retrieval

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning

X-clip: End-to-end multi-grained contrastive learning for video-text retrieval

Clip2video: Mastering video-text retrieval via image clip

Clip4clip: An empirical study of clip for end to end video clip retrieval

Multi-modal transformer for video retrieval

Self-supervised multimodal versatile networks