Internvideo2: Scaling foundation models for multimodal video understanding
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …
Prospective role of foundation models in advancing autonomous vehicles
J Wu, B Gao, J Gao, J Yu, H Chu, Q Yu, X Gong… - Research, 2024 - spj.science.org
With the development of artificial intelligence and breakthroughs in deep learning, large-
scale foundation models (FMs), such as generative pre-trained transformer (GPT), Sora, etc …
scale foundation models (FMs), such as generative pre-trained transformer (GPT), Sora, etc …
Worldgpt: Empowering llm as multimodal world model
World models are progressively being employed across diverse fields, extending from basic
environment simulation to complex scenario construction. However, existing models are …
environment simulation to complex scenario construction. However, existing models are …
Pyramidal flow matching for efficient video generative modeling
Video generation requires modeling a vast spatiotemporal space, which demands
significant computational resources and data usage. To reduce the complexity, the …
significant computational resources and data usage. To reduce the complexity, the …
Diffusion policy policy optimization
We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework
including best practices for fine-tuning diffusion-based policies (eg Diffusion Policy) in …
including best practices for fine-tuning diffusion-based policies (eg Diffusion Policy) in …
Diffusion models are real-time game engines
We present GameNGen, the first game engine powered entirely by a neural model that
enables real-time interaction with a complex environment over long trajectories at high …
enables real-time interaction with a complex environment over long trajectories at high …
Latent action pretraining from videos
We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised
method for pretraining Vision-Language-Action (VLA) models without ground-truth robot …
method for pretraining Vision-Language-Action (VLA) models without ground-truth robot …
A review of multimodal explainable artificial intelligence: Past, present and future
Artificial intelligence (AI) has rapidly developed through advancements in computational
power and the growth of massive datasets. However, this progress has also heightened …
power and the growth of massive datasets. However, this progress has also heightened …
Retrieval-augmented decision transformer: External memory for in-context rl
In-context learning (ICL) is the ability of a model to learn a new task by observing a few
exemplars in its context. While prevalent in NLP, this capability has recently also been …
exemplars in its context. While prevalent in NLP, this capability has recently also been …
Diffusion models trained with large data are transferable visual models
We show that, simply initializing image understanding models using a pre-trained UNet (or
transformer) of diffusion models, it is possible to achieve remarkable transferable …
transformer) of diffusion models, it is possible to achieve remarkable transferable …