Študovňa Google

Z **ng, Q Feng, H Chen, Q Dai, H Hu, H Xu… - ACM Computing …, 2024 - dl.acm.org

The recent wave of AI-generated content (AIGC) has witnessed substantial success in
computer vision, with the diffusion model playing a crucial role in this achievement. Due to …

Uložiť Citovať Citované 98-krát Súvisiace články Všetky verzie 4

Transformer models used for text-based question answering systems

K Nassiri, M Akhloufi - Applied Intelligence, 2023 - Springer

The question answering system is frequently applied in the area of natural language
processing (NLP) because of the wide variety of applications. It consists of answering …

Uložiť Citovať Citované 100-krát Súvisiace články Všetky verzie 3

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Videomamba: State space model for efficient video understanding

K Li, X Li, Y Wang, Y He, Y Wang, L Wang… - European Conference on …, 2024 - Springer

Addressing the dual challenges of local redundancy and global dependencies in video
understanding, this work innovatively adapts the Mamba to the video domain. The proposed …

Uložiť Citovať Citované 160-krát Súvisiace články Všetky verzie 7

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Y Wang, Y He, Y Li, K Li, J Yu, X Ma, X Li… - arxiv preprint arxiv …, 2023 - arxiv.org

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables
learning powerful and transferable video-text representations for multimodal understanding …

Uložiť Citovať Citované 234-krát Súvisiace články Všetky verzie 5 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Unmasked teacher: Towards training-efficient video foundation models

K Li, Y Wang, Y Li, Y Wang, Y He… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Video Foundation Models (VFMs) have received limited exploration due to high
computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …

Uložiť Citovať Citované 156-krát Súvisiace články Všetky verzie 6 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Learning video representations from large language models

Y Zhao, I Misra, P Krähenbühl… - Proceedings of the …, 2023 - openaccess.thecvf.com

We introduce LAVILA, a new approach to learning video-language representations by
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …

Uložiť Citovať Citované 179-krát Súvisiace články Všetky verzie 8 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023 - proceedings.neurips.cc

Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …

Uložiť Citovať Citované 110-krát Súvisiace články Všetky verzie 6 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Compositional exemplars for in-context learning

J Ye, Z Wu, J Feng, T Yu… - … Conference on Machine …, 2023 - proceedings.mlr.press

Large pretrained language models (LMs) have shown impressive In-Context Learning (ICL)
ability, where the model learns to do an unseen task simply by conditioning on a prompt …

Uložiť Citovať Citované 112-krát Súvisiace články Všetky verzie 8 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Zero-shot video question answering via frozen bidirectional language models

A Yang, A Miech, J Sivic, I Laptev… - Advances in Neural …, 2022 - proceedings.neurips.cc

Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …

Uložiť Citovať Citované 233-krát Súvisiace články Všetky verzie 11 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Prompting visual-language models for efficient video understanding

C Ju, T Han, K Zheng, Y Zhang, W **e - European Conference on …, 2022 - Springer

Image-based visual-language (I-VL) pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, revealing remarkable ability for …

Uložiť Citovať Citované 432-krát Súvisiace články Všetky verzie 7

Vytvoriť upozornenie

Citovať

Rozšírené vyhľadávanie

Uložené do mojej knižnice

Movie description

A survey on video diffusion models

Transformer models used for text-based question answering systems

Videomamba: State space model for efficient video understanding

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Unmasked teacher: Towards training-efficient video foundation models

Learning video representations from large language models

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

Compositional exemplars for in-context learning

Zero-shot video question answering via frozen bidirectional language models

Prompting visual-language models for efficient video understanding