- Academic Search

L Zhao, NB Gundavarapu, L Yuan, H Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce VideoPrism, a general-purpose video encoder that tackles diverse video
understanding tasks with a single frozen model. We pretrain VideoPrism on a …

Tallenna Viittaa Viittausten määrä 31 Aiheeseen liittyviä artikkeleita Kaikki 10 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Motionllm: Understanding human behaviors from human motions and videos

LH Chen, S Lu, A Zeng, H Zhang, B Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

This study delves into the realm of multi-modality (ie, video and motion modalities) human
behavior understanding by leveraging the powerful capabilities of Large Language Models …

Tallenna Viittaa Viittausten määrä 23 Aiheeseen liittyviä artikkeleita Kaikki 3 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

W Li, H Fan, Y Wong… - Advances in Neural …, 2025 - proceedings.neurips.cc

Recent advancements in image understanding have benefited from the extensive use of
web image-text pairs. However, video understanding remains a challenge despite the …

Tallenna Viittaa Viittausten määrä 3 Aiheeseen liittyviä artikkeleita Kaikki 4 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Real3d: Scaling up large reconstruction models with real-world images

H Jiang, Q Huang, G Pavlakos - arxiv preprint arxiv:2406.08479, 2024 - arxiv.org

The default strategy for training single-view Large Reconstruction Models (LRMs) follows the
fully supervised route using large-scale datasets of synthetic 3D assets or multi-view …

Tallenna Viittaa Viittausten määrä 7 Aiheeseen liittyviä artikkeleita Kaikki 2 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024 - arxiv.org

Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

Tallenna Viittaa Viittausten määrä 6 Aiheeseen liittyviä artikkeleita Kaikki 2 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] techrxiv.org

Foundation models for video understanding: A survey

N Madan, A Møgelmose, R Modi, YS Rawat… - Authorea …, 2024 - techrxiv.org

Video Foundation Models (ViFMs) aim to develop general-purpose representations for
various video understanding tasks by leveraging large-scale datasets and powerful models …

Tallenna Viittaa Viittausten määrä 19 Aiheeseen liittyviä artikkeleita Kaikki 4 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

MoS²: Mixture of Scale and Shift Experts for Text-Only Video Captioning

H Jia, Y Xu, L Zhu, G Chen, Y Wang… - Proceedings of the 32nd …, 2024 - dl.acm.org

Video captioning is a challenging task and typically requires paired video-text data for
training. However, manually annotating coherent textual descriptions for videos is laborious …

Tallenna Viittaa Viittausten määrä 2 Aiheeseen liittyviä artikkeleita Kaikki 2 versiota

Video Question Answering: A survey of the state-of-the-art

PJ Jeshmol, BC Kovoor - Journal of Visual Communication and Image …, 2024 - Elsevier

Abstract Video Question Answering (VideoQA) emerges as a prominent trend in the domain
of Artificial Intelligence, Computer Vision, and Natural Language Processing. It involves …

Tallenna Viittaa Aiheeseen liittyviä artikkeleita

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline

D Han, E Park, G Lee, A Lee, N Kwak - arxiv preprint arxiv:2407.12508, 2024 - arxiv.org

The rapid expansion of multimedia content has made accurately retrieving relevant videos
from large collections increasingly challenging. Recent advancements in text-video retrieval …

Tallenna Viittaa Viittausten määrä 2 Aiheeseen liittyviä artikkeleita Kaikki 4 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] biorxiv.org

Video Foundation Models for Animal Behavior Analysis

JJ Sun, H Zhou, L Zhao, L Yuan, B Seybold, D Hendon… - bioRxiv, 2024 - biorxiv.org

Computational approaches leveraging computer vision and machine learning have
transformed the quantification of animal behavior from video. However, existing methods …

Tallenna Viittaa Viittausten määrä 1 Aiheeseen liittyviä artikkeleita Kaikki 2 versiota HTML-versio

Luo ilmoitus

Viittaa

Tarkennettu haku

Tallennettu omaan kirjastoon

Distilling vision-language models on millions of videos

Videoprism: A foundational visual encoder for video understanding

Motionllm: Understanding human behaviors from human motions and videos

TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

Real3d: Scaling up large reconstruction models with real-world images

Apollo: An exploration of video understanding in large multimodal models

Foundation models for video understanding: A survey

MoS²: Mixture of Scale and Shift Experts for Text-Only Video Captioning

Video Question Answering: A survey of the state-of-the-art

MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline

Video Foundation Models for Animal Behavior Analysis