- Academic Search

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Zapisz Cytuj Cytowane przez 197 Powiązane artykuły Wszystkie wersje 7 Wyszukiwanie bibliotek Wersja HTML

[Free GPT-4]

[PDF] acm.org

Video description: A survey of methods, datasets, and evaluation metrics

N Aafaq, A Mian, W Liu, SZ Gilani, M Shah - ACM Computing Surveys …, 2019 - dl.acm.org

Video description is the automatic generation of natural language sentences that describe
the contents of a given video. It has applications in human-robot interaction, hel** the …

Zapisz Cytuj Cytowane przez 257 Powiązane artykuły Wszystkie wersje 10

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Zapisz Cytuj Cytowane przez 118 Powiązane artykuły Wszystkie wersje 3

[Free GPT-4]

[PDF] thecvf.com

Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

Zapisz Cytuj Cytowane przez 177 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Internvideo: General video foundation models via generative and discriminative learning

Y Wang, K Li, Y Li, Y He, B Huang, Z Zhao… - arxiv preprint arxiv …, 2022 - arxiv.org

The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …

Zapisz Cytuj Cytowane przez 326 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]

[PDF] mlr.press

mplug-2: A modularized multi-modal foundation model across text, image and video

H Xu, Q Ye, M Yan, Y Shi, J Ye, Y Xu… - International …, 2023 - proceedings.mlr.press

Recent years have witnessed a big convergence of language, vision, and multi-modal
pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized …

Zapisz Cytuj Cytowane przez 128 Powiązane artykuły Wszystkie wersje 6 Wersja HTML

[Free GPT-4]

[PDF] neurips.cc

Zero-shot video question answering via frozen bidirectional language models

A Yang, A Miech, J Sivic, I Laptev… - Advances in Neural …, 2022 - proceedings.neurips.cc

Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …

Zapisz Cytuj Cytowane przez 231 Powiązane artykuły Wszystkie wersje 11 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Y Wang, Y He, Y Li, K Li, J Yu, X Ma, X Li… - arxiv preprint arxiv …, 2023 - arxiv.org

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables
learning powerful and transferable video-text representations for multimodal understanding …

Zapisz Cytuj Cytowane przez 215 Powiązane artykuły Wszystkie wersje 4 Wersja HTML

[Free GPT-4]

[PDF] thecvf.com

Frozen in time: A joint video and image encoder for end-to-end retrieval

M Bain, A Nagrani, G Varol… - Proceedings of the …, 2021 - openaccess.thecvf.com

Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …

Zapisz Cytuj Cytowane przez 1162 Powiązane artykuły Wszystkie wersje 12 Wersja HTML

[Free GPT-4]

[PDF] thecvf.com

Less is more: Clipbert for video-and-language learning via sparse sampling

J Lei, L Li, L Zhou, Z Gan, TL Berg… - Proceedings of the …, 2021 - openaccess.thecvf.com

The canonical approach to video-and-language learning (eg, video question answering)
dictates a neural model to learn from offline-extracted dense video features from vision …

Zapisz Cytuj Cytowane przez 745 Powiązane artykuły Wszystkie wersje 8 Wersja HTML

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

A dataset for movie description

Vision-language pre-training: Basics, recent advances, and future trends

Video description: A survey of methods, datasets, and evaluation metrics

Internvideo2: Scaling foundation models for multimodal video understanding

Moviechat: From dense token to sparse memory for long video understanding

Internvideo: General video foundation models via generative and discriminative learning

mplug-2: A modularized multi-modal foundation model across text, image and video

Zero-shot video question answering via frozen bidirectional language models

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Frozen in time: A joint video and image encoder for end-to-end retrieval

Less is more: Clipbert for video-and-language learning via sparse sampling