Študovňa Google

Y Li, C Wang, J Jia - European Conference on Computer Vision, 2024 - Springer

In this work, we present a novel method to tackle the token generation challenge in Vision
Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current …

Uložiť Citovať Citované 227-krát Súvisiace články Všetky verzie 7

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Timechat: A time-sensitive multimodal large language model for long video understanding

S Ren, L Yao, S Li, X Sun… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …

Uložiť Citovať Citované 140-krát Súvisiace články Všetky verzie 7 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

B He, H Li, YK Jang, M Jia, X Cao… - Proceedings of the …, 2024 - openaccess.thecvf.com

With the success of large language models (LLMs) integrating the vision model into LLMs to
build vision-language foundation models has gained much more interest recently. However …

Uložiť Citovať Citované 70-krát Súvisiace články Všetky verzie 6 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

C Fu, Y Dai, Y Luo, L Li, S Ren, R Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs)
have emerged as a focal point in recent advancements. However, the predominant focus …

Uložiť Citovať Citované 171-krát Súvisiace články Všetky verzie 3 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms

Z Cheng, S Leng, H Zhang, Y **n, X Li, G Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-
LLMs) designed to enhance spatial-temporal modeling and audio understanding in video …

Uložiť Citovať Citované 158-krát Súvisiace články Všetky verzie 3 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Large models for time series and spatio-temporal data: A survey and outlook

M **, Q Wen, Y Liang, C Zhang, S Xue, X Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

Temporal data, notably time series and spatio-temporal data, are prevalent in real-world
applications. They capture dynamic system measurements and are produced in vast …

Uložiť Citovať Citované 122-krát Súvisiace články Všetky verzie 4 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Streaming dense video captioning

X Zhou, A Arnab, S Buch, S Yan… - Proceedings of the …, 2024 - openaccess.thecvf.com

An ideal model for dense video captioning--predicting captions localized temporally in a
video--should be able to handle long input videos predict rich detailed textual descriptions …

Uložiť Citovať Citované 32-krát Súvisiace články Všetky verzie 7 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arxiv preprint arxiv …, 2024 - arxiv.org

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

Uložiť Citovať Citované 84-krát Súvisiace články Všetky verzie 3 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Pllava: Parameter-free llava extension from images to videos for video dense captioning

L Xu, Y Zhao, D Zhou, Z Lin, SK Ng, J Feng - arxiv preprint arxiv …, 2024 - arxiv.org

Vision-language pre-training has significantly elevated performance across a wide range of
image-language applications. Yet, the pre-training process for video-related tasks demands …

Uložiť Citovať Citované 108-krát Súvisiace články Všetky verzie 3 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Streaming long video understanding with large language models

R Qian, X Dong, P Zhang, Y Zang… - Advances in …, 2025 - proceedings.neurips.cc

This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for
video understanding, that capably understands arbitrary-length video with a constant …

Uložiť Citovať Citované 23-krát Súvisiace články Všetky verzie 5 HTML verzia

Vytvoriť upozornenie

Citovať

Rozšírené vyhľadávanie

Uložené do mojej knižnice

Moviechat: From dense token to sparse memory for long video understanding

Llama-vid: An image is worth 2 tokens in large language models

Timechat: A time-sensitive multimodal large language model for long video understanding

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms

Large models for time series and spatio-temporal data: A survey and outlook

Streaming dense video captioning

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

Pllava: Parameter-free llava extension from images to videos for video dense captioning

Streaming long video understanding with large language models