Google Наука

F Xue, Y Chen, D Li, Q Hu, L Zhu, X Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org

Long-context capability is critical for multi-modal foundation models, especially for long
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …

Запазване Позоваване С позовавания в 34 Сродни статии Всички 4 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution

Z Liu, Y Dong, Z Liu, W Hu, J Lu, Y Rao - arxiv preprint arxiv:2409.12961, 2024 - arxiv.org

Visual data comes in various forms, ranging from small icons of just a few pixels to long
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …

Запазване Позоваване С позовавания в 31 Сродни статии Всички 2 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Lvbench: An extreme long video understanding benchmark

W Wang, Z He, W Hong, Y Cheng, X Zhang, J Qi… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent progress in multimodal large language models has markedly enhanced the
understanding of short videos (typically under one minute), and several evaluation datasets …

Запазване Позоваване С позовавания в 31 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024 - arxiv.org

Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

Запазване Позоваване С позовавания в 6 Сродни статии Всички 2 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability

S Chen, X Lan, Y Yuan, Z Jie, L Ma - arxiv preprint arxiv:2411.18211, 2024 - arxiv.org

Rapid development of large language models (LLMs) has significantly advanced multimodal
large language models (LMMs), particularly in vision-language tasks. However, existing …

Запазване Позоваване С позовавания в 5 Сродни статии Всички 2 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

xgen-mm-vid (blip-3-video): You only need 32 tokens to represent a video even in vlms

MS Ryoo, H Zhou, S Kendre, C Qin, L Xue… - arxiv preprint arxiv …, 2024 - arxiv.org

We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos,
particularly designed to efficiently capture temporal information over multiple frames. BLIP-3 …

Запазване Позоваване С позовавания в 5 Сродни статии Всички 2 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Internlm-xcomposer2. 5-omnilive: A comprehensive multimodal system for long-term streaming video and audio interactions

P Zhang, X Dong, Y Cao, Y Zang, R Qian, X Wei… - arxiv preprint arxiv …, 2024 - arxiv.org

Creating AI systems that can interact with environments over long periods, similar to human
cognition, has been a longstanding research goal. Recent advancements in multimodal …

Запазване Позоваване С позовавания в 4 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

K Gong, K Feng, B Li, Y Wang, M Cheng… - arxiv preprint arxiv …, 2024 - arxiv.org

Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro,
and Reka Core, have expanded their capabilities to include vision and audio modalities …

Запазване Позоваване С позовавания в 2 Сродни статии Всички 2 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models

J Chen, T Zhang, S Huang, Y Niu, L Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Despite the recent breakthroughs achieved by Large Vision Language Models (LVLMs) in
understanding and responding to complex visual-textual contexts, their inherent …

Запазване Позоваване С позовавания в 3 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Do language models understand time?

X Ding, L Wang - arxiv preprint arxiv:2412.13845, 2024 - arxiv.org

Large language models (LLMs) have revolutionized video-based computer vision
applications, including action recognition, anomaly detection, and video summarization …

Запазване Позоваване С позовавания в 3 Сродни статии Всички 3 версии Във вид на HTML

Създаване на сигнал

Позоваване

Разширено търсене

Запазено в „Моята библиотека“

Kangaroo: A powerful video-language model supporting long-context video input

Longvila: Scaling long-context visual language models for long videos

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution

Lvbench: An extreme long video understanding benchmark

Apollo: An exploration of video understanding in large multimodal models

Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability

xgen-mm-vid (blip-3-video): You only need 32 tokens to represent a video even in vlms

Internlm-xcomposer2. 5-omnilive: A comprehensive multimodal system for long-term streaming video and audio interactions

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models

Do language models understand time?