- Academic Search

M Abdar, M Kollati, S Kuraparthi… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org

Video captioning (VC) is a fast-moving, cross-disciplinary area of research that comprises
contributions from domains such as computer vision, natural language processing …

Zapisz Cytuj Cytowane przez 21 Powiązane artykuły Wszystkie wersje 3

[Free GPT-4]

[HTML] sciencedirect.com

[HTML][HTML] Survey: Transformer based video-language pre-training

L Ruan, Q ** - AI Open, 2022 - Elsevier

Inspired by the success of transformer-based pre-training methods on natural language
tasks and further computer vision tasks, researchers have started to apply transformer to …

Zapisz Cytuj Cytowane przez 55 Powiązane artykuły Wszystkie wersje 5

[Free GPT-4]

[PDF] arxiv.org

Internvideo: General video foundation models via generative and discriminative learning

Y Wang, K Li, Y Li, Y He, B Huang, Z Zhao… - arxiv preprint arxiv …, 2022 - arxiv.org

The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …

Zapisz Cytuj Cytowane przez 327 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]

[PDF] neurips.cc

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023 - proceedings.neurips.cc

Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …

Zapisz Cytuj Cytowane przez 102 Powiązane artykuły Wszystkie wersje 6 Wersja HTML

[Free GPT-4]

[PDF] neurips.cc

Zero-shot video question answering via frozen bidirectional language models

A Yang, A Miech, J Sivic, I Laptev… - Advances in Neural …, 2022 - proceedings.neurips.cc

Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …

Zapisz Cytuj Cytowane przez 231 Powiązane artykuły Wszystkie wersje 11 Wersja HTML

[Free GPT-4]

[PDF] thecvf.com

Less is more: Clipbert for video-and-language learning via sparse sampling

J Lei, L Li, L Zhou, Z Gan, TL Berg… - Proceedings of the …, 2021 - openaccess.thecvf.com

The canonical approach to video-and-language learning (eg, video question answering)
dictates a neural model to learn from offline-extracted dense video features from vision …

Zapisz Cytuj Cytowane przez 746 Powiązane artykuły Wszystkie wersje 8 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Videogpt: Video generation using vq-vae and transformers

W Yan, Y Zhang, P Abbeel, A Srinivas - arxiv preprint arxiv:2104.10157, 2021 - arxiv.org

We present VideoGPT: a conceptually simple architecture for scaling likelihood based
generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled …

Zapisz Cytuj Cytowane przez 469 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]

[PDF] thecvf.com

Next-qa: Next phase of question-answering to explaining temporal actions

J **ao, X Shang, A Yao… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

We introduce NExT-QA, a rigorously designed video question answering (VideoQA)
benchmark to advance video understanding from describing to explaining the temporal …

Zapisz Cytuj Cytowane przez 362 Powiązane artykuły Wszystkie wersje 6 Wersja HTML

[Free GPT-4]

[PDF] thecvf.com

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

A Miech, D Zhukov, JB Alayrac… - Proceedings of the …, 2019 - openaccess.thecvf.com

Learning text-video embeddings usually requires a dataset of video clips with manually
provided captions. However, such datasets are expensive and time consuming to create and …

Zapisz Cytuj Cytowane przez 1309 Powiązane artykuły Wszystkie wersje 10 Wersja HTML

[Free GPT-4]

[PDF] thecvf.com

From recognition to cognition: Visual commonsense reasoning

R Zellers, Y Bisk, A Farhadi… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

Visual understanding goes well beyond object recognition. With one glance at an image, we
can effortlessly imagine the world beyond the pixels: for instance, we can infer people's …

Zapisz Cytuj Cytowane przez 1020 Powiązane artykuły Wszystkie wersje 7 Wersja HTML

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

TGIF: A new dataset and benchmark on animated GIF description

A review of deep learning for video captioning

[HTML][HTML] Survey: Transformer based video-language pre-training

Internvideo: General video foundation models via generative and discriminative learning

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

Zero-shot video question answering via frozen bidirectional language models

Less is more: Clipbert for video-and-language learning via sparse sampling

Videogpt: Video generation using vq-vae and transformers

Next-qa: Next phase of question-answering to explaining temporal actions

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

From recognition to cognition: Visual commonsense reasoning