- Academic Search

P Zhao, H Zhang, Q Yu, Z Wang, Y Geng, F Fu… - arxiv preprint arxiv …, 2024 - arxiv.org

The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by
advancements in model algorithms, scalable foundation model architectures, and the …

Lagre Referanse Sitert av 206 Beslektede artikler Alle 3 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Locvtp: Video-text pre-training for temporal localization

M Cao, T Yang, J Weng, C Zhang, J Wang… - European Conference on …, 2022 - Springer

Abstract Video-Text Pre-training (VTP) aims to learn transferable representations for various
downstream tasks from large-scale web videos. To date, almost all existing VTP methods …

Lagre Referanse Sitert av 72 Beslektede artikler Alle 7 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Rap: Efficient text-video retrieval with sparse-and-correlated adapter

M Cao, H Tang, J Huang, P **, C Zhang, R Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

Text-Video Retrieval (TVR) aims to align relevant video content with natural language
queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning …

Lagre Referanse Sitert av 13 Beslektede artikler Alle 5 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[HTML] sciencedirect.com

[HTML][HTML] Style-aware two-stage learning framework for video captioning

Y Ma, Z Zhu, Y Qi, A Beheshti, Y Li, L Qing… - Knowledge-Based Systems, 2024 - Elsevier

Significant progress has been made in video captioning in recent years. However, most
existing methods directly learn from all given captions without distinguishing the styles of …

Lagre Referanse Sitert av 7 Beslektede artikler Alle 5 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Muse: Mamba is efficient multi-scale learner for text-video retrieval

H Tang, M Cao, J Huang, R Liu, P **, G Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Text-Video Retrieval (TVR) aims to align and associate relevant video content with
corresponding natural language queries. Most existing TVR methods are based on large …

Lagre Referanse Sitert av 8 Beslektede artikler Alle 4 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] aaai.org

Exploiting auxiliary caption for video grounding

H Li, M Cao, X Cheng, Y Li, Z Zhu, Y Zou - Proceedings of the AAAI …, 2024 - ojs.aaai.org

Video grounding aims to locate a moment of interest matching the given query sentence
from an untrimmed video. Previous works ignore the\emph {sparsity dilemma} in video …

Lagre Referanse Sitert av 15 Beslektede artikler Alle 5 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Fintextqa: A dataset for long-form financial question answering

J Chen, P Zhou, Y Hua, Y Loh, K Chen, Z Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Accurate evaluation of financial question answering (QA) systems necessitates a
comprehensive dataset encompassing diverse question types and contexts. However …

Lagre Referanse Sitert av 4 Beslektede artikler Alle 6 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] aaai.org

Embracing language inclusivity and diversity in CLIP through continual language learning

B Yang, Y Dai, X Cheng, Y Li, A Raza… - Proceedings of the AAAI …, 2024 - ojs.aaai.org

While vision-language pre-trained models (VL-PTMs) have advanced multimodal research
in recent years, their mastery in a few languages like English restricts their applicability in …

Lagre Referanse Sitert av 5 Beslektede artikler Alle 6 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Physgame: Uncovering physical commonsense violations in gameplay videos

M Cao, H Tang, H Zhao, H Guo, J Liu, G Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advancements in video-based large language models (Video LLMs) have witnessed
the emergence of diverse capabilities to reason and interpret dynamic visual content …

Lagre Referanse Sitert av 4 Beslektede artikler Alle 3 versjoner HTML-versjon

Zero-shot temporal action detection by learning multimodal prompts and text-enhanced actionness

A Raza, B Yang, Y Zou - … on Circuits and Systems for Video …, 2024 - ieeexplore.ieee.org

Zero-shot temporal action detection (ZS-TAD), aiming to recognize and detect new and
unseen video actions, is an emerging and challenging task with limited solutions. Recent …

Lagre Referanse Sitert av 2 Beslektede artikler Alle 2 versjoner

Opprett varsel

Referanse

Avansert søk

Lagret i Mitt bibliotek

Concept-aware video captioning: Describing videos with effective prior information

Retrieval-augmented generation for ai-generated content: A survey

Locvtp: Video-text pre-training for temporal localization

Rap: Efficient text-video retrieval with sparse-and-correlated adapter

[HTML][HTML] Style-aware two-stage learning framework for video captioning

Muse: Mamba is efficient multi-scale learner for text-video retrieval

Exploiting auxiliary caption for video grounding

Fintextqa: A dataset for long-form financial question answering

Embracing language inclusivity and diversity in CLIP through continual language learning

Physgame: Uncovering physical commonsense violations in gameplay videos

Zero-shot temporal action detection by learning multimodal prompts and text-enhanced actionness