Video understanding with large language models: A survey
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …
content, the demand for proficient video understanding tools has intensified markedly. Given …
Slowfast-llava: A strong training-free baseline for video large language models
We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …
Apollo: An exploration of video understanding in large multimodal models
Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …
(LMMs), the underlying mechanisms driving their video understanding remain poorly …
AMEGO: Active Memory from long EGOcentric videos
Egocentric videos provide a unique perspective into individuals' daily experiences, yet their
unstructured nature presents challenges for perception. In this paper, we introduce AMEGO …
unstructured nature presents challenges for perception. In this paper, we introduce AMEGO …
Tarsier: Recipes for training and evaluating large video description models
Generating fine-grained video descriptions is a fundamental challenge in video
understanding. In this work, we introduce Tarsier, a family of large-scale video-language …
understanding. In this work, we introduce Tarsier, a family of large-scale video-language …
Videollamb: Long-context video understanding with recurrent memory bridges
Recent advancements in large-scale video-language models have shown significant
potential for real-time planning and detailed interactions. However, their high computational …
potential for real-time planning and detailed interactions. However, their high computational …
Vlm-grounder: A vlm agent for zero-shot 3d visual grounding
3D visual grounding is crucial for robots, requiring integration of natural language and 3D
scene understanding. Traditional methods depending on supervised learning with 3D point …
scene understanding. Traditional methods depending on supervised learning with 3D point …
Multi-modal generative ai: Multi-modal llm, diffusion and beyond
Multi-modal generative AI has received increasing attention in both academia and industry.
Particularly, two dominant families of techniques are: i) The multi-modal large language …
Particularly, two dominant families of techniques are: i) The multi-modal large language …
Large language models for mobility in transportation systems: A survey on forecasting tasks
Mobility analysis is a crucial element in the research area of transportation systems.
Forecasting traffic information offers a viable solution to address the conflict between …
Forecasting traffic information offers a viable solution to address the conflict between …
Episodic memory verbalization using hierarchical representations of life-long robot experience
Verbalization of robot experience, ie, summarization of and question answering about a
robot's past, is a crucial ability for improving human-robot interaction. Previous works …
robot's past, is a crucial ability for improving human-robot interaction. Previous works …