- Academic Search

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Save Cite Cited by 118 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] ieee.org

When does Sora show: The beginning of TAO to imaginative intelligence and scenarios engineering

FY Wang, Q Miao, L Li, Q Ni, X Li, J Li… - IEEE/CAA Journal of …, 2024 - ieeexplore.ieee.org

During our discussion at workshops for writing “What Does ChatGPT Say: The DAO from
Algorithmic Intelligence to Linguistic Intelligence”[1], we had expected the next milestone for …

Save Cite Cited by 57 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] science.org

Prospective role of foundation models in advancing autonomous vehicles

J Wu, B Gao, J Gao, J Yu, H Chu, Q Yu, X Gong… - Research, 2024 - spj.science.org

With the development of artificial intelligence and breakthroughs in deep learning, large-
scale foundation models (FMs), such as generative pre-trained transformer (GPT), Sora, etc …

Save Cite Cited by 3 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Modeling caption diversity in contrastive vision-language pretraining

S Lavoie, P Kirichenko, M Ibrahim, M Assran… - ar** an image and its caption to a single vector--limiting …

Save Cite Cited by 15 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024 - arxiv.org

Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

Save Cite Cited by 6 Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Dino-wm: World models on pre-trained visual features enable zero-shot planning

G Zhou, H Pan, Y LeCun, L Pinto - arxiv preprint arxiv:2411.04983, 2024 - arxiv.org

The ability to predict future outcomes given control actions is fundamental for physical
reasoning. However, such predictive models, often called world models, have proven …

Save Cite Cited by 4 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

World models for autonomous driving: An initial survey

Y Guan, H Liao, Z Li, J Hu, R Yuan, Y Li… - IEEE Transactions …, 2024 - ieeexplore.ieee.org

In the rapidly evolving landscape of autonomous driving, the capability to accurately predict
future events and assess their implications is paramount for both safety and efficiency …

Save Cite Cited by 23 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Vic-mae: Self-supervised representation learning from images and video with contrastive masked autoencoders

J Hernandez, R Villegas, V Ordonez - European Conference on Computer …, 2024 - Springer

We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and
contrastive learning. ViC-MAE is trained using a global representation obtained by pooling …

Save Cite Cited by 8 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Lexicon3d: Probing visual foundation models for complex 3d scene understanding

Y Man, S Zheng, Z Bao, M Hebert, LY Gui… - arxiv preprint arxiv …, 2024 - arxiv.org

Complex 3D scene understanding has gained increasing attention, with scene encoding
strategies playing a crucial role in this success. However, the optimal scene encoding …

Save Cite Cited by 4 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Igor: Image-goal representations are the atomic control units for foundation models in embodied ai

X Chen, J Guo, T He, C Zhang, P Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce Image-GOal Representations (IGOR), aiming to learn a unified, semantically
consistent action space across human and various robots. Through this unified latent action …

Save Cite Cited by 4 Related articles View as HTML

Cite

Advanced search

Saved to My library

Internvideo2: Scaling foundation models for multimodal video understanding

When does Sora show: The beginning of TAO to imaginative intelligence and scenarios engineering

Prospective role of foundation models in advancing autonomous vehicles

Modeling caption diversity in contrastive vision-language pretraining

Apollo: An exploration of video understanding in large multimodal models

Dino-wm: World models on pre-trained visual features enable zero-shot planning

World models for autonomous driving: An initial survey

Vic-mae: Self-supervised representation learning from images and video with contrastive masked autoencoders

Lexicon3d: Probing visual foundation models for complex 3d scene understanding

Igor: Image-goal representations are the atomic control units for foundation models in embodied ai