- Academic Search

R Qian, X Dong, P Zhang, Y Zang… - Advances in …, 2025 - proceedings.neurips.cc

This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for
video understanding, that capably understands arbitrary-length video with a constant …

บันทึก อ้างอิง อ้างโดย21 บทความที่เกี่ยวข้อง ทั้งหมด 5 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A simple llm framework for long-range video question-answering

C Zhang, T Lu, MM Islam, Z Wang, S Yu… - arxiv preprint arxiv …, 2023 - arxiv.org

We present LLoVi, a language-based framework for long-range video question-answering
(LVQA). Unlike prior long-range video understanding methods, which are often costly and …

บันทึก อ้างอิง อ้างโดย71 บทความที่เกี่ยวข้อง ทั้งหมด 5 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

X Wang, Y Zhang, O Zohar, S Yeung-Levy - European Conference on …, 2024 - Springer

Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …

บันทึก อ้างอิง อ้างโดย46 บทความที่เกี่ยวข้อง ทั้งหมด 9 ฉบับ

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Towards generalist robot learning from internet video: A survey

R McCarthy, DCH Tan, D Schmidt, F Acero… - arxiv preprint arxiv …, 2024 - arxiv.org

Scaling deep learning to massive, diverse internet data has yielded remarkably general
capabilities in visual and natural language understanding and generation. However, data …

บันทึก อ้างอิง อ้างโดย9 บทความที่เกี่ยวข้อง ทั้งหมด 3 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] aclanthology.org

Anymal: An efficient and scalable any-modality augmented language model

S Moon, A Madotto, Z Lin, T Nagarajan… - Proceedings of the …, 2024 - aclanthology.org

Abstract We present Any-Modality Augmented Language Model (AnyMAL), a unified model
that reasons over diverse input modality signals (ie text, image, video, audio, IMU motion …

บันทึก อ้างอิง อ้างโดย79 บทความที่เกี่ยวข้อง ทั้งหมด 3 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Language repository for long video understanding

K Kahatapitiya, K Ranasinghe, J Park… - arxiv preprint arxiv …, 2024 - arxiv.org

Language has become a prominent modality in computer vision with the rise of LLMs.
Despite supporting long context-lengths, their effectiveness in handling long-term …

บันทึก อ้างอิง อ้างโดย17 บทความที่เกี่ยวข้อง ทั้งหมด 5 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Memory consolidation enables long-context video understanding

I Balažević, Y Shi, P Papalampidi, R Chaabouni… - arxiv preprint arxiv …, 2024 - arxiv.org

Most transformer-based video encoders are limited to short temporal contexts due to their
quadratic complexity. While various attempts have been made to extend this context, this …

บันทึก อ้างอิง อ้างโดย19 บทความที่เกี่ยวข้อง ทั้งหมด 6 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

W Li, H Fan, Y Wong… - Advances in Neural …, 2025 - proceedings.neurips.cc

Recent advancements in image understanding have benefited from the extensive use of
web image-text pairs. However, video understanding remains a challenge despite the …

บันทึก อ้างอิง อ้างโดย3 บทความที่เกี่ยวข้อง ทั้งหมด 4 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Videollamb: Long-context video understanding with recurrent memory bridges

Y Wang, C **e, Y Liu, Z Zheng - arxiv preprint arxiv:2409.01071, 2024 - arxiv.org

Recent advancements in large-scale video-language models have shown significant
potential for real-time planning and detailed interactions. However, their high computational …

บันทึก อ้างอิง อ้างโดย4 บทความที่เกี่ยวข้อง ทั้งหมด 2 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Drvideo: Document retrieval based long video understanding

Z Ma, C Gou, H Shi, B Sun, S Li, H Rezatofighi… - arxiv preprint arxiv …, 2024 - arxiv.org

Existing methods for long video understanding primarily focus on videos only lasting tens of
seconds, with limited exploration of techniques for handling longer videos. The increased …

บันทึก อ้างอิง อ้างโดย6 บทความที่เกี่ยวข้อง ทั้งหมด 2 ฉบับ ดูในรูปแบบ HTML

สร้างการแจ้งเตือน

อ้างอิง

การค้นหาขั้นสูง

บันทึกไปยังคลังของฉันแล้ว

A simple recipe for contrastively pre-training video-first encoders beyond 16 frames

Streaming long video understanding with large language models

A simple llm framework for long-range video question-answering

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

Towards generalist robot learning from internet video: A survey

Anymal: An efficient and scalable any-modality augmented language model

Language repository for long video understanding

Memory consolidation enables long-context video understanding

TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

Videollamb: Long-context video understanding with recurrent memory bridges

Drvideo: Document retrieval based long video understanding