Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Slowfast-llava: A strong training-free baseline for video large language models

M Xu, M Gao, Z Gan, HY Chen, Z Lai, H Gang… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

AMEGO: Active Memory from long EGOcentric videos

G Goletto, T Nagarajan, G Averta, D Damen - European Conference on …, 2024 - Springer
Egocentric videos provide a unique perspective into individuals' daily experiences, yet their
unstructured nature presents challenges for perception. In this paper, we introduce AMEGO …

Tarsier: Recipes for training and evaluating large video description models

J Wang, L Yuan, Y Zhang, H Sun - arxiv preprint arxiv:2407.00634, 2024 - arxiv.org
Generating fine-grained video descriptions is a fundamental challenge in video
understanding. In this work, we introduce Tarsier, a family of large-scale video-language …

Videollamb: Long-context video understanding with recurrent memory bridges

Y Wang, C **e, Y Liu, Z Zheng - arxiv preprint arxiv:2409.01071, 2024 - arxiv.org
Recent advancements in large-scale video-language models have shown significant
potential for real-time planning and detailed interactions. However, their high computational …

Vlm-grounder: A vlm agent for zero-shot 3d visual grounding

R Xu, Z Huang, T Wang, Y Chen, J Pang… - arxiv preprint arxiv …, 2024 - arxiv.org
3D visual grounding is crucial for robots, requiring integration of natural language and 3D
scene understanding. Traditional methods depending on supervised learning with 3D point …

Multi-modal generative ai: Multi-modal llm, diffusion and beyond

H Chen, X Wang, Y Zhou, B Huang, Y Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-modal generative AI has received increasing attention in both academia and industry.
Particularly, two dominant families of techniques are: i) The multi-modal large language …

Large language models for mobility in transportation systems: A survey on forecasting tasks

Z Zhang, Y Sun, Z Wang, Y Nie, X Ma, P Sun… - arxiv preprint arxiv …, 2024 - arxiv.org
Mobility analysis is a crucial element in the research area of transportation systems.
Forecasting traffic information offers a viable solution to address the conflict between …

Episodic memory verbalization using hierarchical representations of life-long robot experience

L Bärmann, C DeChant, J Plewnia… - arxiv preprint arxiv …, 2024 - arxiv.org
Verbalization of robot experience, ie, summarization of and question answering about a
robot's past, is a crucial ability for improving human-robot interaction. Previous works …