- Academic Search

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Save Cite Cited by 118 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] science.org

Prospective role of foundation models in advancing autonomous vehicles

J Wu, B Gao, J Gao, J Yu, H Chu, Q Yu, X Gong… - Research, 2024 - spj.science.org

With the development of artificial intelligence and breakthroughs in deep learning, large-
scale foundation models (FMs), such as generative pre-trained transformer (GPT), Sora, etc …

Save Cite Cited by 3 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Worldgpt: Empowering llm as multimodal world model

Z Ge, H Huang, M Zhou, J Li, G Wang, S Tang… - Proceedings of the …, 2024 - dl.acm.org

World models are progressively being employed across diverse fields, extending from basic
environment simulation to complex scenario construction. However, existing models are …

Save Cite Cited by 19 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Pyramidal flow matching for efficient video generative modeling

Y **, Z Sun, N Li, K Xu, H Jiang, N Zhuang… - arxiv preprint arxiv …, 2024 - arxiv.org

Video generation requires modeling a vast spatiotemporal space, which demands
significant computational resources and data usage. To reduce the complexity, the …

Save Cite Cited by 17 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Diffusion policy policy optimization

AZ Ren, J Lidard, LL Ankile, A Simeonov… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework
including best practices for fine-tuning diffusion-based policies (eg Diffusion Policy) in …

Save Cite Cited by 13 Related articles All 4 versions Free GPT-4 View as HTML

Diffusion models are real-time game engines

D Valevski, Y Leviathan, M Arar, S Fruchter - arxiv preprint arxiv …, 2024 - arxiv.org

We present GameNGen, the first game engine powered entirely by a neural model that
enables real-time interaction with a complex environment over long trajectories at high …

Save Cite Cited by 16 Related articles All 3 versions Free GPT-4 Cached

[Free GPT-4]

[PDF] arxiv.org

Latent action pretraining from videos

S Ye, J Jang, B Jeon, S Joo, J Yang, B Peng… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised
method for pretraining Vision-Language-Action (VLA) models without ground-truth robot …

Save Cite Cited by 8 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

A review of multimodal explainable artificial intelligence: Past, present and future

S Sun, W An, F Tian, F Nan, Q Liu, J Liu, N Shah… - arxiv preprint arxiv …, 2024 - arxiv.org

Artificial intelligence (AI) has rapidly developed through advancements in computational
power and the growth of massive datasets. However, this progress has also heightened …

Save Cite Cited by 2 Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Retrieval-augmented decision transformer: External memory for in-context rl

T Schmied, F Paischer, V Patil, M Hofmarcher… - arxiv preprint arxiv …, 2024 - arxiv.org

In-context learning (ICL) is the ability of a model to learn a new task by observing a few
exemplars in its context. While prevalent in NLP, this capability has recently also been …

Save Cite Cited by 5 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Diffusion models trained with large data are transferable visual models

G Xu, Y Ge, M Liu, C Fan, K **e, Z Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

We show that, simply initializing image understanding models using a pre-trained UNet (or
transformer) of diffusion models, it is possible to achieve remarkable transferable …

Save Cite Cited by 17 Related articles All 2 versions Free GPT-4 View as HTML

Cite

Advanced search

Saved to My library

Internvideo2: Scaling foundation models for multimodal video understanding

Prospective role of foundation models in advancing autonomous vehicles

Worldgpt: Empowering llm as multimodal world model

Pyramidal flow matching for efficient video generative modeling

Diffusion policy policy optimization

Diffusion models are real-time game engines

Latent action pretraining from videos

A review of multimodal explainable artificial intelligence: Past, present and future

Retrieval-augmented decision transformer: External memory for in-context rl

Diffusion models trained with large data are transferable visual models