Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks

H Xu, Q Ye, X Wu, M Yan, Y Miao, J Ye, G Xu… - arxiv preprint arxiv …, 2023 - arxiv.org
To promote the development of Vision-Language Pre-training (VLP) and multimodal Large
Language Model (LLM) in the Chinese community, we firstly release the largest public …

RTQ: Rethinking Video-language Understanding Based on Image-text Model

X Wang, Y Li, T Gan, Z Zhang, J Lv, L Nie - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Recent advancements in video-language understanding have been established on the
foundation of image-text models, resulting in promising outcomes due to the shared …

ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

R Zhao, J Jia, Y Li, X Bai, Q Chen, H Li, P Jiang… - arxiv preprint arxiv …, 2024 - arxiv.org
E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain
manner as images, short videos, or live stream promotions. A unified and vectorized cross …

M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

X Dong, Z Feng, C Zhou, X Yu, M Yang… - Proceedings of the 47th …, 2024 - dl.acm.org
We present a Recipe for Effective and Efficient zero-shot video-text Retrieval, dubbed M2-
RAAP. Upon popular image-text models like CLIP, most current adaptation-based video-text …

Temporal Sentence Grounding in Streaming Videos

T Gan, X Wang, Y Sun, J Wu, Q Guo, L Nie - Proceedings of the 31st …, 2023 - dl.acm.org
This paper aims to tackle a novel task-Temporal Sentence Grounding in Streaming Videos
(TSGSV). The goal of TSGSV is to evaluate the relevance between a video stream and a …

A Chinese Multimodal Social Video Dataset for Controversy Detection

T Xu, A Chen, Y Zhao, J Gao, T Gan - Proceedings of the 32nd ACM …, 2024 - dl.acm.org
Social video platforms have emerged as significant channels for information dissemination,
facilitating lively public discussions that often give rise to controversies. However, existing …

CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios

X Qiao, X Li, X Qu, J Zhang, Y Liu, Y Luo, C **… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-Language Models pre-trained on large-scale image-text datasets have shown
superior performance in downstream tasks such as image retrieval. Most of the images for …

The Devil is in the Word: Video-Conditioned Text Representation Refinement for Text-to-Video Retrieval

J Guo, F Wei, J Ma, C Xu - openreview.net
Pre-trained vision-language models (VLMs), such as CLIP, have shown remarkable success
in the text-video retrieval task due to their strong vision-language representations learned …