Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Prospective role of foundation models in advancing autonomous vehicles

J Wu, B Gao, J Gao, J Yu, H Chu, Q Yu, X Gong… - Research, 2024 - spj.science.org
With the development of artificial intelligence and breakthroughs in deep learning, large-
scale foundation models (FMs), such as generative pre-trained transformer (GPT), Sora, etc …

Worldgpt: Empowering llm as multimodal world model

Z Ge, H Huang, M Zhou, J Li, G Wang, S Tang… - Proceedings of the …, 2024 - dl.acm.org
World models are progressively being employed across diverse fields, extending from basic
environment simulation to complex scenario construction. However, existing models are …

Pyramidal flow matching for efficient video generative modeling

Y **, Z Sun, N Li, K Xu, H Jiang, N Zhuang… - arxiv preprint arxiv …, 2024 - arxiv.org
Video generation requires modeling a vast spatiotemporal space, which demands
significant computational resources and data usage. To reduce the complexity, the …

Diffusion policy policy optimization

AZ Ren, J Lidard, LL Ankile, A Simeonov… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework
including best practices for fine-tuning diffusion-based policies (eg Diffusion Policy) in …

Diffusion models are real-time game engines

D Valevski, Y Leviathan, M Arar, S Fruchter - arxiv preprint arxiv …, 2024 - arxiv.org
We present GameNGen, the first game engine powered entirely by a neural model that
enables real-time interaction with a complex environment over long trajectories at high …

Latent action pretraining from videos

S Ye, J Jang, B Jeon, S Joo, J Yang, B Peng… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised
method for pretraining Vision-Language-Action (VLA) models without ground-truth robot …

A review of multimodal explainable artificial intelligence: Past, present and future

S Sun, W An, F Tian, F Nan, Q Liu, J Liu, N Shah… - arxiv preprint arxiv …, 2024 - arxiv.org
Artificial intelligence (AI) has rapidly developed through advancements in computational
power and the growth of massive datasets. However, this progress has also heightened …

Retrieval-augmented decision transformer: External memory for in-context rl

T Schmied, F Paischer, V Patil, M Hofmarcher… - arxiv preprint arxiv …, 2024 - arxiv.org
In-context learning (ICL) is the ability of a model to learn a new task by observing a few
exemplars in its context. While prevalent in NLP, this capability has recently also been …

Diffusion models trained with large data are transferable visual models

G Xu, Y Ge, M Liu, C Fan, K **e, Z Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
We show that, simply initializing image understanding models using a pre-trained UNet (or
transformer) of diffusion models, it is possible to achieve remarkable transferable …