- Academic Search

Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks

H Xu, Q Ye, X Wu, M Yan, Y Miao, J Ye, G Xu… - arxiv preprint arxiv …, 2023 - arxiv.org

To promote the development of Vision-Language Pre-training (VLP) and multimodal Large
Language Model (LLM) in the Chinese community, we firstly release the largest public …

Spara Citera Citerat av 22 Relaterade artiklar Alla 3 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

RTQ: Rethinking Video-language Understanding Based on Image-text Model

X Wang, Y Li, T Gan, Z Zhang, J Lv, L Nie - Proceedings of the 31st ACM …, 2023 - dl.acm.org

Recent advancements in video-language understanding have been established on the
foundation of image-text models, resulting in promising outcomes due to the shared …

Spara Citera Citerat av 6 Relaterade artiklar Alla 3 versionerna

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

R Zhao, J Jia, Y Li, X Bai, Q Chen, H Li, P Jiang… - arxiv preprint arxiv …, 2024 - arxiv.org

E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain
manner as images, short videos, or live stream promotions. A unified and vectorized cross …

Spara Citera Citerat av 2 Relaterade artiklar Alla 3 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

M²-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

X Dong, Z Feng, C Zhou, X Yu, M Yang… - Proceedings of the 47th …, 2024 - dl.acm.org

We present a Recipe for Effective and Efficient zero-shot video-text Retrieval, dubbed M2-
RAAP. Upon popular image-text models like CLIP, most current adaptation-based video-text …

Spara Citera Citerat av 1 Relaterade artiklar Alla 2 versionerna

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Temporal Sentence Grounding in Streaming Videos

T Gan, X Wang, Y Sun, J Wu, Q Guo, L Nie - Proceedings of the 31st …, 2023 - dl.acm.org

This paper aims to tackle a novel task-Temporal Sentence Grounding in Streaming Videos
(TSGSV). The goal of TSGSV is to evaluate the relevance between a video stream and a …

Spara Citera Citerat av 2 Relaterade artiklar Alla 3 versionerna

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

A Chinese Multimodal Social Video Dataset for Controversy Detection

T Xu, A Chen, Y Zhao, J Gao, T Gan - Proceedings of the 32nd ACM …, 2024 - dl.acm.org

Social video platforms have emerged as significant channels for information dissemination,
facilitating lively public discussions that often give rise to controversies. However, existing …

Spara Citera Relaterade artiklar Alla 2 versionerna

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios

X Qiao, X Li, X Qu, J Zhang, Y Liu, Y Luo, C **… - arxiv preprint arxiv …, 2024 - arxiv.org

Vision-Language Models pre-trained on large-scale image-text datasets have shown
superior performance in downstream tasks such as image retrieval. Most of the images for …

Spara Citera Relaterade artiklar Alla 2 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

The Devil is in the Word: Video-Conditioned Text Representation Refinement for Text-to-Video Retrieval

J Guo, F Wei, J Ma, C Xu - openreview.net

Pre-trained vision-language models (VLMs), such as CLIP, have shown remarkable success
in the text-video retrieval task due to their strong vision-language representations learned …

Spara Citera Relaterade artiklar Se som HTML-version

Skapa alarm

Citera

Avancerad sökning

Har sparats i Mitt bibliotek

Cnvid-3.5 m: Build, filter, and pre-train the large-scale public chinese video-text dataset

Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks

RTQ: Rethinking Video-language Understanding Based on Image-text Model

ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

M²-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

Temporal Sentence Grounding in Streaming Videos

A Chinese Multimodal Social Video Dataset for Controversy Detection

CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios

The Devil is in the Word: Video-Conditioned Text Representation Refinement for Text-to-Video Retrieval