- Academic Search

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Speichern Zitieren Zitiert von: 59 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Slowfast-llava: A strong training-free baseline for video large language models

M Xu, M Gao, Z Gan, HY Chen, Z Lai, H Gang… - arxiv preprint arxiv …, 2024 - arxiv.org

We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …

Speichern Zitieren Zitiert von: 29 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024 - arxiv.org

Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

Speichern Zitieren Zitiert von: 6 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

AMEGO: Active Memory from long EGOcentric videos

G Goletto, T Nagarajan, G Averta, D Damen - European Conference on …, 2024 - Springer

Egocentric videos provide a unique perspective into individuals' daily experiences, yet their
unstructured nature presents challenges for perception. In this paper, we introduce AMEGO …

Speichern Zitieren Zitiert von: 3 Ähnliche Artikel Alle 12 Versionen

[Free GPT-4]

[PDF] arxiv.org

Tarsier: Recipes for training and evaluating large video description models

J Wang, L Yuan, Y Zhang, H Sun - arxiv preprint arxiv:2407.00634, 2024 - arxiv.org

Generating fine-grained video descriptions is a fundamental challenge in video
understanding. In this work, we introduce Tarsier, a family of large-scale video-language …

Speichern Zitieren Zitiert von: 14 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Videollamb: Long-context video understanding with recurrent memory bridges

Y Wang, C **e, Y Liu, Z Zheng - arxiv preprint arxiv:2409.01071, 2024 - arxiv.org

Recent advancements in large-scale video-language models have shown significant
potential for real-time planning and detailed interactions. However, their high computational …

Speichern Zitieren Zitiert von: 4 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Vlm-grounder: A vlm agent for zero-shot 3d visual grounding

R Xu, Z Huang, T Wang, Y Chen, J Pang… - arxiv preprint arxiv …, 2024 - arxiv.org

3D visual grounding is crucial for robots, requiring integration of natural language and 3D
scene understanding. Traditional methods depending on supervised learning with 3D point …

Speichern Zitieren Zitiert von: 3 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Multi-modal generative ai: Multi-modal llm, diffusion and beyond

H Chen, X Wang, Y Zhou, B Huang, Y Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Multi-modal generative AI has received increasing attention in both academia and industry.
Particularly, two dominant families of techniques are: i) The multi-modal large language …

Speichern Zitieren Zitiert von: 6 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Large language models for mobility in transportation systems: A survey on forecasting tasks

Z Zhang, Y Sun, Z Wang, Y Nie, X Ma, P Sun… - arxiv preprint arxiv …, 2024 - arxiv.org

Mobility analysis is a crucial element in the research area of transportation systems.
Forecasting traffic information offers a viable solution to address the conflict between …

Speichern Zitieren Zitiert von: 22 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Episodic memory verbalization using hierarchical representations of life-long robot experience

L Bärmann, C DeChant, J Plewnia… - arxiv preprint arxiv …, 2024 - arxiv.org

Verbalization of robot experience, ie, summarization of and question answering about a
robot's past, is a crucial ability for improving human-robot interaction. Previous works …

Speichern Zitieren Zitiert von: 2 Ähnliche Artikel Alle 4 Versionen HTML-Version

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

Videoagent: Long-form video understanding with large language model as agent

Video understanding with large language models: A survey

Slowfast-llava: A strong training-free baseline for video large language models

Apollo: An exploration of video understanding in large multimodal models

AMEGO: Active Memory from long EGOcentric videos

Tarsier: Recipes for training and evaluating large video description models

Videollamb: Long-context video understanding with recurrent memory bridges

Vlm-grounder: A vlm agent for zero-shot 3d visual grounding

Multi-modal generative ai: Multi-modal llm, diffusion and beyond

Large language models for mobility in transportation systems: A survey on forecasting tasks

Episodic memory verbalization using hierarchical representations of life-long robot experience