Open-magvit2: An open-source project toward democratizing auto-regressive visual generation

Z Luo, F Shi, Y Ge, Y Yang, L Wang, Y Shan - arxiv preprint arxiv …, 2024 - arxiv.org
We present Open-MAGVIT2, a family of auto-regressive image generation models ranging
from 300M to 1.5 B. The Open-MAGVIT2 project produces an open-source replication of …

Do generative video models learn physical principles from watching videos?

S Motamed, L Culp, K Swersky, P Jaini… - arxiv preprint arxiv …, 2025 - arxiv.org
AI video generation is undergoing a revolution, with quality and realism advancing rapidly.
These advances have led to a passionate scientific debate: Do video models learn``world …

A Survey of Embodied AI in Healthcare: Techniques, Applications, and Opportunities

Y Liu, X Cao, T Chen, Y Jiang, J You, M Wu… - arxiv preprint arxiv …, 2025 - arxiv.org
Healthcare systems worldwide face persistent challenges in efficiency, accessibility, and
personalization. Powered by modern AI technologies such as multimodal large language …

Multimodal Medical Code Tokenizer

X Su, S Messica, Y Huang, R Johnson, L Fesser… - arxiv preprint arxiv …, 2025 - arxiv.org
Foundation models trained on patient electronic health records (EHRs) require tokenizing
medical data into sequences of discrete vocabulary items. Existing tokenizers treat medical …

InternVideo2. 5: Empowering Video MLLMs with Long and Rich Context Modeling

Y Wang, X Li, Z Yan, Y He, J Yu, X Zeng… - arxiv preprint arxiv …, 2025 - arxiv.org
This paper aims to improve the performance of video multimodal large language models
(MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of …

DiffusionRenderer: Neural Inverse and Forward Rendering with Video Diffusion Models

R Liang, Z Gojcic, H Ling, J Munkberg… - arxiv preprint arxiv …, 2025 - arxiv.org
Understanding and modeling lighting effects are fundamental tasks in computer vision and
graphics. Classic physically-based rendering (PBR) accurately simulates the light transport …

Improving the Diffusability of Autoencoders

I Skorokhodov, S Girish, B Hu, W Menapace… - arxiv preprint arxiv …, 2025 - arxiv.org
Latent diffusion models have emerged as the leading approach for generating high-quality
images and videos, utilizing compressed latent representations to reduce the computational …

Goku: Flow Based Video Generative Foundation Models

S Chen, C Ge, Y Zhang, Y Zhang, F Zhu… - arxiv preprint arxiv …, 2025 - arxiv.org
This paper introduces Goku, a state-of-the-art family of joint image-and-video generation
models leveraging rectified flow Transformers to achieve industry-leading performance. We …

Trajectory World Models for Heterogeneous Environments

S Yin, J Wu, S Huang, X Su, X He, J Hao… - arxiv preprint arxiv …, 2025 - arxiv.org
Heterogeneity in sensors and actuators across environments poses a significant challenge
to building large-scale pre-trained world models on top of this low-dimensional sensor …

DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation

Z Yuan, S Wang, R **e, H Zhang, T Fang… - arxiv preprint arxiv …, 2025 - arxiv.org
In this paper, we propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a training-free
paradigm that can make use of adaptive temporal compression in latent space. While …