Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks
To promote the development of Vision-Language Pre-training (VLP) and multimodal Large
Language Model (LLM) in the Chinese community, we firstly release the largest public …
Language Model (LLM) in the Chinese community, we firstly release the largest public …
RTQ: Rethinking Video-language Understanding Based on Image-text Model
Recent advancements in video-language understanding have been established on the
foundation of image-text models, resulting in promising outcomes due to the shared …
foundation of image-text models, resulting in promising outcomes due to the shared …
ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval
E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain
manner as images, short videos, or live stream promotions. A unified and vectorized cross …
manner as images, short videos, or live stream promotions. A unified and vectorized cross …
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
We present a Recipe for Effective and Efficient zero-shot video-text Retrieval, dubbed M2-
RAAP. Upon popular image-text models like CLIP, most current adaptation-based video-text …
RAAP. Upon popular image-text models like CLIP, most current adaptation-based video-text …
Temporal Sentence Grounding in Streaming Videos
This paper aims to tackle a novel task-Temporal Sentence Grounding in Streaming Videos
(TSGSV). The goal of TSGSV is to evaluate the relevance between a video stream and a …
(TSGSV). The goal of TSGSV is to evaluate the relevance between a video stream and a …
A Chinese Multimodal Social Video Dataset for Controversy Detection
Social video platforms have emerged as significant channels for information dissemination,
facilitating lively public discussions that often give rise to controversies. However, existing …
facilitating lively public discussions that often give rise to controversies. However, existing …
CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios
X Qiao, X Li, X Qu, J Zhang, Y Liu, Y Luo, C **… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-Language Models pre-trained on large-scale image-text datasets have shown
superior performance in downstream tasks such as image retrieval. Most of the images for …
superior performance in downstream tasks such as image retrieval. Most of the images for …
The Devil is in the Word: Video-Conditioned Text Representation Refinement for Text-to-Video Retrieval
J Guo, F Wei, J Ma, C Xu - openreview.net
Pre-trained vision-language models (VLMs), such as CLIP, have shown remarkable success
in the text-video retrieval task due to their strong vision-language representations learned …
in the text-video retrieval task due to their strong vision-language representations learned …