Študovňa Google

B He, H Li, YK Jang, M Jia, X Cao… - Proceedings of the …, 2024 - openaccess.thecvf.com

With the success of large language models (LLMs) integrating the vision model into LLMs to
build vision-language foundation models has gained much more interest recently. However …

Uložiť Citovať Citované 67-krát Súvisiace články Všetky verzie 6 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Omnitokenizer: A joint image-video tokenizer for visual generation

J Wang, Y Jiang, Z Yuan, B Peng… - Advances in Neural …, 2025 - proceedings.neurips.cc

Tokenizer, serving as a translator to map the intricate visual data into a compact latent
space, lies at the core of visual generative models. Based on the finding that existing …

Uložiť Citovať Citované 22-krát Súvisiace články Všetky verzie 5 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Uložiť Citovať Citované 63-krát Súvisiace články Všetky verzie 3 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Exploring pre-trained text-to-video diffusion models for referring video object segmentation

Z Zhu, X Feng, D Chen, J Yuan, C Qiao… - European Conference on …, 2024 - Springer

In this paper, we explore the visual representations produced from a pre-trained text-to-
video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent …

Uložiť Citovať Citované 5-krát Súvisiace články Všetky verzie 7

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Trafficvlm: A controllable visual language model for traffic video captioning

QM Dinh, MK Ho, AQ Dang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Traffic video description and analysis have received much attention recently due to the
growing demand for efficient and reliable urban surveillance systems. Most existing methods …

Uložiť Citovať Citované 7-krát Súvisiace články Všetky verzie 5 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Aid: Adapting image2video diffusion models for instruction-guided video prediction

Z **ng, Q Dai, Z Weng, Z Wu, YG Jiang - arxiv preprint arxiv:2406.06465, 2024 - arxiv.org

Text-guided video prediction (TVP) involves predicting the motion of future frames from the
initial frame according to an instruction, which has wide applications in virtual reality …

Uložiť Citovať Citované 11-krát Súvisiace články Všetky verzie 2 HTML verzia

OmniTracker: Unifying Visual Object Tracking by Tracking-with-Detection

J Wang, Z Wu, D Chen, C Luo, X Dai… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Visual Object Tracking (VOT) aims to estimate the positions of target objects in a video
sequence, which is an important vision task with various real-world applications. Depending …

Uložiť Citovať Citované 1-krát Súvisiace články Všetky verzie 2

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Do language models understand time?

X Ding, L Wang - arxiv preprint arxiv:2412.13845, 2024 - arxiv.org

Large language models (LLMs) have revolutionized video-based computer vision
applications, including action recognition, anomaly detection, and video summarization …

Uložiť Citovať Citované 3-krát Súvisiace články Všetky verzie 3 HTML verzia

EIKA: Explicit & Implicit Knowledge-Augmented Network for entity-aware sports video captioning

Z **, G Shi, H Sun, B Zhang, S Li, L Wu - Expert Systems with Applications, 2025 - Elsevier

Sports video captioning in real application scenarios requires both entities and specific
scenes. However, it is difficult to extract this fine-grained information solely from the video …

Uložiť Citovať Súvisiace články

A simple yet effective knowledge guided method for entity-aware video captioning on a basketball benchmark

Z **, G Shi, X Li, J Yan, Z Li, L Wu, Z Liu, L Wang - Neurocomputing, 2025 - Elsevier

Despite the recent emergence of video captioning models, how to generate the text
description with specific entity names and fine-grained actions is far from being solved …

Uložiť Citovať Citované 1-krát Súvisiace články

Vytvoriť upozornenie

Citovať

Rozšírené vyhľadávanie

Uložené do mojej knižnice

Omnivid: A generative framework for universal video understanding

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

Omnitokenizer: A joint image-video tokenizer for visual generation

Video understanding with large language models: A survey

Exploring pre-trained text-to-video diffusion models for referring video object segmentation

Trafficvlm: A controllable visual language model for traffic video captioning

Aid: Adapting image2video diffusion models for instruction-guided video prediction

OmniTracker: Unifying Visual Object Tracking by Tracking-with-Detection

Do language models understand time?

EIKA: Explicit & Implicit Knowledge-Augmented Network for entity-aware sports video captioning

A simple yet effective knowledge guided method for entity-aware video captioning on a basketball benchmark