Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Timechat: A time-sensitive multimodal large language model for long video understanding

S Ren, L Yao, S Li, X Sun… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …

Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection

Y **ao, Z Luo, Y Liu, Y Ma, H Bian… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted
significant attention due to the growing demand for video analysis. Recent approaches treat …

Rethinking weakly-supervised video temporal grounding from a game perspective

X Fang, Z **ong, W Fang, X Qu, C Chen, J Dong… - … on Computer Vision, 2024 - Springer
This paper addresses the challenging task of weakly-supervised video temporal grounding.
Existing approaches are generally based on the moment proposal selection framework that …

Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model

S Pramanick, G Han, R Hou, S Nag… - Proceedings of the …, 2024 - openaccess.thecvf.com
The ability of large language models (LLMs) to process visual inputs has given rise to
general-purpose vision systems unifying various vision-language (VL) tasks by instruction …

Omnivid: A generative framework for universal video understanding

J Wang, D Chen, C Luo, B He, L Yuan… - Proceedings of the …, 2024 - openaccess.thecvf.com
The core of video understanding tasks such as recognition captioning and tracking is to
automatically detect objects or actions in a video and analyze their temporal evolution …

Video mamba suite: State space model as a versatile alternative for video understanding

G Chen, Y Huang, J Xu, B Pei, Z Chen, Z Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Understanding videos is one of the fundamental directions in computer vision research, with
extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and …

-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Y Liu, J He, W Li, J Kim, D Wei, H Pfister… - European Conference on …, 2024 - Springer
Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to
ground relevant clips in untrimmed videos given natural language queries. Most existing …

Correlation-guided query-dependency calibration in video representation learning for temporal grounding

WJ Moon, S Hyun, SB Lee, JP Heo - CoRR, 2023 - openreview.net
Temporal Grounding is to identify specific moments or highlights from a video corresponding
to textual descriptions. Typical approaches in temporal grounding treat all video clips …

Spikemba: Multi-modal spiking saliency mamba for temporal video grounding

W Li, X Hong, R **ong, X Fan - arxiv preprint arxiv:2404.01174, 2024 - arxiv.org
Temporal video grounding (TVG) is a critical task in video content understanding, requiring
precise alignment between video content and natural language instructions. Despite …