Videoprism: A foundational visual encoder for video understanding

L Zhao, NB Gundavarapu, L Yuan, H Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce VideoPrism, a general-purpose video encoder that tackles diverse video
understanding tasks with a single frozen model. We pretrain VideoPrism on a …

Motionllm: Understanding human behaviors from human motions and videos

LH Chen, S Lu, A Zeng, H Zhang, B Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
This study delves into the realm of multi-modality (ie, video and motion modalities) human
behavior understanding by leveraging the powerful capabilities of Large Language Models …

TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

W Li, H Fan, Y Wong… - Advances in Neural …, 2025 - proceedings.neurips.cc
Recent advancements in image understanding have benefited from the extensive use of
web image-text pairs. However, video understanding remains a challenge despite the …

Real3d: Scaling up large reconstruction models with real-world images

H Jiang, Q Huang, G Pavlakos - arxiv preprint arxiv:2406.08479, 2024 - arxiv.org
The default strategy for training single-view Large Reconstruction Models (LRMs) follows the
fully supervised route using large-scale datasets of synthetic 3D assets or multi-view …

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

Foundation models for video understanding: A survey

N Madan, A Møgelmose, R Modi, YS Rawat… - Authorea …, 2024 - techrxiv.org
Video Foundation Models (ViFMs) aim to develop general-purpose representations for
various video understanding tasks by leveraging large-scale datasets and powerful models …

MoS2: Mixture of Scale and Shift Experts for Text-Only Video Captioning

H Jia, Y Xu, L Zhu, G Chen, Y Wang… - Proceedings of the 32nd …, 2024 - dl.acm.org
Video captioning is a challenging task and typically requires paired video-text data for
training. However, manually annotating coherent textual descriptions for videos is laborious …

Video Question Answering: A survey of the state-of-the-art

PJ Jeshmol, BC Kovoor - Journal of Visual Communication and Image …, 2024 - Elsevier
Abstract Video Question Answering (VideoQA) emerges as a prominent trend in the domain
of Artificial Intelligence, Computer Vision, and Natural Language Processing. It involves …

MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline

D Han, E Park, G Lee, A Lee, N Kwak - arxiv preprint arxiv:2407.12508, 2024 - arxiv.org
The rapid expansion of multimedia content has made accurately retrieving relevant videos
from large collections increasingly challenging. Recent advancements in text-video retrieval …

Video Foundation Models for Animal Behavior Analysis

JJ Sun, H Zhou, L Zhao, L Yuan, B Seybold, D Hendon… - bioRxiv, 2024 - biorxiv.org
Computational approaches leveraging computer vision and machine learning have
transformed the quantification of animal behavior from video. However, existing methods …