Internvideo2: Scaling foundation models for multimodal video understanding
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …
Timechat: A time-sensitive multimodal large language model for long video understanding
This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …
designed for long video understanding. Our model incorporates two key architectural …
Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection
Abstract Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted
significant attention due to the growing demand for video analysis. Recent approaches treat …
significant attention due to the growing demand for video analysis. Recent approaches treat …
Rethinking weakly-supervised video temporal grounding from a game perspective
This paper addresses the challenging task of weakly-supervised video temporal grounding.
Existing approaches are generally based on the moment proposal selection framework that …
Existing approaches are generally based on the moment proposal selection framework that …
Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model
The ability of large language models (LLMs) to process visual inputs has given rise to
general-purpose vision systems unifying various vision-language (VL) tasks by instruction …
general-purpose vision systems unifying various vision-language (VL) tasks by instruction …
Omnivid: A generative framework for universal video understanding
The core of video understanding tasks such as recognition captioning and tracking is to
automatically detect objects or actions in a video and analyze their temporal evolution …
automatically detect objects or actions in a video and analyze their temporal evolution …
Video mamba suite: State space model as a versatile alternative for video understanding
Understanding videos is one of the fundamental directions in computer vision research, with
extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and …
extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and …
-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to
ground relevant clips in untrimmed videos given natural language queries. Most existing …
ground relevant clips in untrimmed videos given natural language queries. Most existing …
Correlation-guided query-dependency calibration in video representation learning for temporal grounding
Temporal Grounding is to identify specific moments or highlights from a video corresponding
to textual descriptions. Typical approaches in temporal grounding treat all video clips …
to textual descriptions. Typical approaches in temporal grounding treat all video clips …
Spikemba: Multi-modal spiking saliency mamba for temporal video grounding
Temporal video grounding (TVG) is a critical task in video content understanding, requiring
precise alignment between video content and natural language instructions. Despite …
precise alignment between video content and natural language instructions. Despite …