Internvideo2: Scaling foundation models for multimodal video understanding
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …
When does Sora show: The beginning of TAO to imaginative intelligence and scenarios engineering
During our discussion at workshops for writing “What Does ChatGPT Say: The DAO from
Algorithmic Intelligence to Linguistic Intelligence”[1], we had expected the next milestone for …
Algorithmic Intelligence to Linguistic Intelligence”[1], we had expected the next milestone for …
Prospective role of foundation models in advancing autonomous vehicles
J Wu, B Gao, J Gao, J Yu, H Chu, Q Yu, X Gong… - Research, 2024 - spj.science.org
With the development of artificial intelligence and breakthroughs in deep learning, large-
scale foundation models (FMs), such as generative pre-trained transformer (GPT), Sora, etc …
scale foundation models (FMs), such as generative pre-trained transformer (GPT), Sora, etc …
Modeling caption diversity in contrastive vision-language pretraining
S Lavoie, P Kirichenko, M Ibrahim, M Assran… - ar** an image and its caption to a single vector--limiting …
Apollo: An exploration of video understanding in large multimodal models
Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …
(LMMs), the underlying mechanisms driving their video understanding remain poorly …
Dino-wm: World models on pre-trained visual features enable zero-shot planning
The ability to predict future outcomes given control actions is fundamental for physical
reasoning. However, such predictive models, often called world models, have proven …
reasoning. However, such predictive models, often called world models, have proven …
World models for autonomous driving: An initial survey
In the rapidly evolving landscape of autonomous driving, the capability to accurately predict
future events and assess their implications is paramount for both safety and efficiency …
future events and assess their implications is paramount for both safety and efficiency …
Vic-mae: Self-supervised representation learning from images and video with contrastive masked autoencoders
We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and
contrastive learning. ViC-MAE is trained using a global representation obtained by pooling …
contrastive learning. ViC-MAE is trained using a global representation obtained by pooling …
Lexicon3d: Probing visual foundation models for complex 3d scene understanding
Complex 3D scene understanding has gained increasing attention, with scene encoding
strategies playing a crucial role in this success. However, the optimal scene encoding …
strategies playing a crucial role in this success. However, the optimal scene encoding …
Igor: Image-goal representations are the atomic control units for foundation models in embodied ai
We introduce Image-GOal Representations (IGOR), aiming to learn a unified, semantically
consistent action space across human and various robots. Through this unified latent action …
consistent action space across human and various robots. Through this unified latent action …