Retrieval-augmented generation for ai-generated content: A survey

P Zhao, H Zhang, Q Yu, Z Wang, Y Geng, F Fu… - arxiv preprint arxiv …, 2024 - arxiv.org
The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by
advancements in model algorithms, scalable foundation model architectures, and the …

Locvtp: Video-text pre-training for temporal localization

M Cao, T Yang, J Weng, C Zhang, J Wang… - European Conference on …, 2022 - Springer
Abstract Video-Text Pre-training (VTP) aims to learn transferable representations for various
downstream tasks from large-scale web videos. To date, almost all existing VTP methods …

Rap: Efficient text-video retrieval with sparse-and-correlated adapter

M Cao, H Tang, J Huang, P **, C Zhang, R Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Text-Video Retrieval (TVR) aims to align relevant video content with natural language
queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning …

[HTML][HTML] Style-aware two-stage learning framework for video captioning

Y Ma, Z Zhu, Y Qi, A Beheshti, Y Li, L Qing… - Knowledge-Based Systems, 2024 - Elsevier
Significant progress has been made in video captioning in recent years. However, most
existing methods directly learn from all given captions without distinguishing the styles of …

Muse: Mamba is efficient multi-scale learner for text-video retrieval

H Tang, M Cao, J Huang, R Liu, P **, G Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Text-Video Retrieval (TVR) aims to align and associate relevant video content with
corresponding natural language queries. Most existing TVR methods are based on large …

Exploiting auxiliary caption for video grounding

H Li, M Cao, X Cheng, Y Li, Z Zhu, Y Zou - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Video grounding aims to locate a moment of interest matching the given query sentence
from an untrimmed video. Previous works ignore the\emph {sparsity dilemma} in video …

Fintextqa: A dataset for long-form financial question answering

J Chen, P Zhou, Y Hua, Y Loh, K Chen, Z Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Accurate evaluation of financial question answering (QA) systems necessitates a
comprehensive dataset encompassing diverse question types and contexts. However …

Embracing language inclusivity and diversity in CLIP through continual language learning

B Yang, Y Dai, X Cheng, Y Li, A Raza… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
While vision-language pre-trained models (VL-PTMs) have advanced multimodal research
in recent years, their mastery in a few languages like English restricts their applicability in …

Physgame: Uncovering physical commonsense violations in gameplay videos

M Cao, H Tang, H Zhao, H Guo, J Liu, G Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in video-based large language models (Video LLMs) have witnessed
the emergence of diverse capabilities to reason and interpret dynamic visual content …

Zero-shot temporal action detection by learning multimodal prompts and text-enhanced actionness

A Raza, B Yang, Y Zou - … on Circuits and Systems for Video …, 2024 - ieeexplore.ieee.org
Zero-shot temporal action detection (ZS-TAD), aiming to recognize and detect new and
unseen video actions, is an emerging and challenging task with limited solutions. Recent …