Benchmark evaluations, applications, and challenges of large vision language models: A survey

Z Li, X Wu, H Du, H Nghiem, G Shi - arxiv preprint arxiv:2501.02189, 2025 - arxiv.org
Multimodal Vision Language Models (VLMs) have emerged as a transformative technology
at the intersection of computer vision and natural language processing, enabling machines …

: A Vision-Language-Action Flow Model for General Robot Control

K Black, N Brown, D Driess, A Esmail, M Equi… - arxiv preprint arxiv …, 2024 - arxiv.org
Robot learning holds tremendous promise to unlock the full potential of flexible, general, and
dexterous robot systems, as well as to address some of the deepest questions in artificial …

Open-sora plan: Open-source large video generation model

B Lin, Y Ge, X Cheng, Z Li, B Zhu, S Wang, X He… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce Open-Sora Plan, an open-source project that aims to contribute a large
generation model for generating desired high-resolution videos with long durations based …

Provably optimal memory capacity for modern hopfield models: Transformer-compatible dense associative memories as spherical codes

JYC Hu, D Wu, H Liu - arxiv preprint arxiv:2410.23126, 2024 - arxiv.org
We study the optimal memorization capacity of modern Hopfield models and Kernelized
Hopfield Models (KHMs), a transformer-compatible class of Dense Associative Memories …

Inference-time scaling for diffusion models beyond scaling denoising steps

N Ma, S Tong, H Jia, H Hu, YC Su, M Zhang… - arxiv preprint arxiv …, 2025 - arxiv.org
Generative models have made significant impacts across various domains, largely due to
their ability to scale during training by increasing data, computational resources, and model …

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

S Yuan, J Huang, X He, Y Ge, Y Shi, L Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with
consistent human identity. It is an important task in video generation but remains an open …

Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning

DJ Zhang, R Paiss, S Zada, N Karnad… - arxiv preprint arxiv …, 2024 - arxiv.org
Recently, breakthroughs in video modeling have allowed for controllable camera trajectories
in generated videos. However, these methods cannot be directly applied to user-provided …

Motion Prompting: Controlling Video Generation with Motion Trajectories

D Geng, C Herrmann, J Hur, F Cole, S Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Motion control is crucial for generating expressive and compelling video content; however,
most existing video generation models rely mainly on text prompts for control, which struggle …

Navigation world models

A Bar, G Zhou, D Tran, T Darrell, Y LeCun - arxiv preprint arxiv …, 2024 - arxiv.org
Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a
Navigation World Model (NWM), a controllable video generation model that predicts future …

Ltx-video: Realtime video latent diffusion

Y HaCohen, N Chiprut, B Brazowski, D Shalem… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce LTX-Video, a transformer-based latent diffusion model that adopts a holistic
approach to video generation by seamlessly integrating the responsibilities of the Video …