Revisiting feature prediction for learning visual representations from video

A Bardes, Q Garrido, J Ponce, X Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper explores feature prediction as a stand-alone objective for unsupervised learning
from video and introduces V-JEPA, a collection of vision models trained solely using a …

Videoprism: A foundational visual encoder for video understanding

L Zhao, NB Gundavarapu, L Yuan, H Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce VideoPrism, a general-purpose video encoder that tackles diverse video
understanding tasks with a single frozen model. We pretrain VideoPrism on a …

Foundation models for video understanding: A survey

N Madan, A Møgelmose, R Modi, YS Rawat… - Authorea …, 2024 - techrxiv.org
Video Foundation Models (ViFMs) aim to develop general-purpose representations for
various video understanding tasks by leveraging large-scale datasets and powerful models …

Multi-perspective traffic video description model with fine-grained refinement approach

TA To, MN Tran, TB Ho, TL Ha… - Proceedings of the …, 2024 - openaccess.thecvf.com
The analysis of traffic patterns is crucial for enhancing safety and optimizing flow within
urban cities. While urban cities possess extensive camera networks for monitoring the raw …

Video-language understanding: A survey from model architecture, model training, and data perspectives

T Nguyen, Y Bin, J **ao, L Qu, Y Li, JZ Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Humans use multiple senses to comprehend the environment. Vision and language are two
of the most vital senses since they allow us to easily communicate our thoughts and …

V-jepa: Latent video prediction for visual representation learning

A Bardes, Q Garrido, J Ponce, X Chen, M Rabbat… - 2023 - openreview.net
This paper shows that the masked-modelling principle driving the success of large
foundational language models can be effectively applied to video by making predictions in …

Videoeval: Comprehensive benchmark suite for low-cost evaluation of video foundation model

X Li, Z Huang, J Wang, K Li, L Wang - arxiv preprint arxiv:2407.06491, 2024 - arxiv.org
With the growth of high-quality data and advancement in visual pre-training paradigms,
Video Foundation Models (VFMs) have made significant progress recently, demonstrating …

LVS: A Learned Video Storage for Fast and Efficient Video Understanding

Y Lee, J Park - Proceedings of the IEEE/CVF Conference …, 2024 - openaccess.thecvf.com
As video understanding (VU) promises unprecedented capabilities in the era of video data
explosion, its efficient computation plays a critical role in practicalizing the algorithmic …

Video Foundation Models for Animal Behavior Analysis

JJ Sun, H Zhou, L Zhao, L Yuan, B Seybold, D Hendon… - bioRxiv, 2024 - biorxiv.org
Computational approaches leveraging computer vision and machine learning have
transformed the quantification of animal behavior from video. However, existing methods …

Video Creation by Demonstration

Y Sun, H Zhou, L Yuan, JJ Sun, Y Li, X Jia… - arxiv preprint arxiv …, 2024 - arxiv.org
We explore a novel video creation experience, namely Video Creation by Demonstration.
Given a demonstration video and a context image from a different scene, we generate a …