A review of deep learning for video captioning
Video captioning (VC) is a fast-moving, cross-disciplinary area of research that comprises
contributions from domains such as computer vision, natural language processing …
contributions from domains such as computer vision, natural language processing …
[HTML][HTML] Survey: Transformer based video-language pre-training
L Ruan, Q ** - AI Open, 2022 - Elsevier
Inspired by the success of transformer-based pre-training methods on natural language
tasks and further computer vision tasks, researchers have started to apply transformer to …
tasks and further computer vision tasks, researchers have started to apply transformer to …
Internvideo: General video foundation models via generative and discriminative learning
The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …
downstream tasks in computer vision. However, most existing vision foundation models …
Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset
Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …
while other modalities such as audio and subtitles in videos have not received sufficient …
Zero-shot video question answering via frozen bidirectional language models
Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …
data for training. Manual annotation of question and answers for videos, however, is tedious …
Less is more: Clipbert for video-and-language learning via sparse sampling
The canonical approach to video-and-language learning (eg, video question answering)
dictates a neural model to learn from offline-extracted dense video features from vision …
dictates a neural model to learn from offline-extracted dense video features from vision …
Videogpt: Video generation using vq-vae and transformers
We present VideoGPT: a conceptually simple architecture for scaling likelihood based
generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled …
generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled …
Next-qa: Next phase of question-answering to explaining temporal actions
We introduce NExT-QA, a rigorously designed video question answering (VideoQA)
benchmark to advance video understanding from describing to explaining the temporal …
benchmark to advance video understanding from describing to explaining the temporal …
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Learning text-video embeddings usually requires a dataset of video clips with manually
provided captions. However, such datasets are expensive and time consuming to create and …
provided captions. However, such datasets are expensive and time consuming to create and …
From recognition to cognition: Visual commonsense reasoning
Visual understanding goes well beyond object recognition. With one glance at an image, we
can effortlessly imagine the world beyond the pixels: for instance, we can infer people's …
can effortlessly imagine the world beyond the pixels: for instance, we can infer people's …