Google Академик

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Сачувај Цитирај 63 пута наведен Сродни чланци Све верзије (3) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation

S Wu, J Chen, KQ Lin, Q Wang, Y Gao… - Advances in …, 2025 - proceedings.neurips.cc

A well-known dilemma in large vision-language models (eg, GPT-4, LLaVA) is that while
increasing the number of vision tokens generally enhances visual understanding, it also …

Сачувај Цитирај 3 пута наведен Сродни чланци Све верзије (4) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Diff-tracker: text-to-image diffusion models are unsupervised trackers

Z Zhang, L Xu, D Peng, H Rahmani, J Liu - European Conference on …, 2024 - Springer

Abstract We introduce Diff-Tracker, a novel approach for the challenging unsupervised
visual tracking task leveraging the pre-trained text-to-image diffusion model. Our main idea …

Сачувај Цитирај 6 пута наведен Сродни чланци Све верзије (8)

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Learning video context as interleaved multimodal sequences

KQ Lin, P Zhang, D Gao, X **a, J Chen, Z Gao… - … on Computer Vision, 2024 - Springer

Narrative videos, such as movies, pose significant challenges in video understanding due to
their rich contexts (characters, dialogues, storylines) and diverse demands (identify who …

Сачувај Цитирај 3 пута наведен Сродни чланци Све верзије (4)

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Videollamb: Long-context video understanding with recurrent memory bridges

Y Wang, C **e, Y Liu, Z Zheng - arxiv preprint arxiv:2409.01071, 2024 - arxiv.org

Recent advancements in large-scale video-language models have shown significant
potential for real-time planning and detailed interactions. However, their high computational …

Сачувај Цитирај 5 пута наведен Сродни чланци Све верзије (2) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Do language models understand time?

X Ding, L Wang - arxiv preprint arxiv:2412.13845, 2024 - arxiv.org

Large language models (LLMs) have revolutionized video-based computer vision
applications, including action recognition, anomaly detection, and video summarization …

Сачувај Цитирај 3 пута наведен Сродни чланци Све верзије (3) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

Y Huang, J Xu, B Pei, Y He, G Chen, L Yang… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-
language model. Designed for deployment on portable devices such as smartphones and …

Сачувај Цитирај 1 пута наведен Сродни чланци Све верзије (2) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

S Zhang, Q Fang, Z Yang, Y Feng - arxiv preprint arxiv:2501.03895, 2025 - arxiv.org

The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked
considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into …

Сачувај Цитирај 2 пута наведен Сродни чланци Све верзије (3) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

StreamChat: Chatting with Streaming Video

J Liu, Z Yu, S Lan, S Wang, R Fang, J Kautz… - arxiv preprint arxiv …, 2024 - arxiv.org

This paper presents StreamChat, a novel approach that enhances the interaction
capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming …

Сачувај Цитирај Сродни чланци Све верзије (2) HTML верзија

[Књига][B] Computer Vision-ECCV 2024: 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XXIV.

A Leonardis, E Ricci, S Roth, O Russakovsky, T Sattler… - 2024 - books.google.com

The multi-volume set of LNCS books with volume numbers 15059 up to 15147 constitutes
the refereed proceedings of the 18th European Conference on Computer Vision, ECCV …

Сачувај Цитирај 2 пута наведен Сродни чланци Све верзије (4) Претрага библиотека

Направи обавештење

Цитирај

Напредна претрага

Сачувано у мојој библиотеци

Videollm-online: Online video large language model for streaming video

Video understanding with large language models: A survey

Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation

Diff-tracker: text-to-image diffusion models are unsupervised trackers

Learning video context as interleaved multimodal sequences

Videollamb: Long-context video understanding with recurrent memory bridges

Do language models understand time?

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

StreamChat: Chatting with Streaming Video

[Књига][B] Computer Vision-ECCV 2024: 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XXIV.