- Academic Search

J Wang, Z Liu, L Zhao, Z Wu, C Ma, S Yu, H Dai… - Meta-Radiology, 2023 - Elsevier

Visual prompt engineering is a fundamental methodology in the field of visual and image
artificial general intelligence. As the development of large vision models progresses, the …

Enregistrer Citer Cité 146 fois Autres articles Les 4 versions Free GPT-4

[Free GPT-4]

[PDF] springer.com

Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer

In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

Enregistrer Citer Cité 218 fois Autres articles Les 10 versions Free GPT-4

[Free GPT-4]

[PDF] thecvf.com

Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

Enregistrer Citer Cité 836 fois Autres articles Les 7 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Videomamba: State space model for efficient video understanding

K Li, X Li, Y Wang, Y He, Y Wang, L Wang… - European Conference on …, 2024 - Springer

Addressing the dual challenges of local redundancy and global dependencies in video
understanding, this work innovatively adapts the Mamba to the video domain. The proposed …

Enregistrer Citer Cité 147 fois Autres articles Les 2 versions Free GPT-4

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Enregistrer Citer Cité 118 fois Autres articles Les 3 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Long-clip: Unlocking the long-text capability of clip

B Zhang, P Zhang, X Dong, Y Zang, J Wang - European Conference on …, 2024 - Springer

Abstract Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-
shot classification, text-image retrieval, and text-image generation by aligning image and …

Enregistrer Citer Cité 80 fois Autres articles Les 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Internvideo: General video foundation models via generative and discriminative learning

Y Wang, K Li, Y Li, Y He, B Huang, Z Zhao… - arxiv preprint arxiv …, 2022 - arxiv.org

The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …

Enregistrer Citer Cité 327 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] neurips.cc

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023 - proceedings.neurips.cc

Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …

Enregistrer Citer Cité 102 fois Autres articles Les 6 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] mlr.press

mplug-2: A modularized multi-modal foundation model across text, image and video

H Xu, Q Ye, M Yan, Y Shi, J Ye, Y Xu… - International …, 2023 - proceedings.mlr.press

Recent years have witnessed a big convergence of language, vision, and multi-modal
pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized …

Enregistrer Citer Cité 129 fois Autres articles Les 6 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] thecvf.com

Unmasked teacher: Towards training-efficient video foundation models

K Li, Y Wang, Y Li, Y Wang, Y He… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Video Foundation Models (VFMs) have received limited exploration due to high
computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …

Enregistrer Citer Cité 143 fois Autres articles Les 5 versions Free GPT-4 Version HTML

Créer l'alerte

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning

[HTML][HTML] Review of large vision models and visual prompt engineering

Vlp: A survey on vision-language pre-training

Imagebind: One embedding space to bind them all

Videomamba: State space model for efficient video understanding

Internvideo2: Scaling foundation models for multimodal video understanding

Long-clip: Unlocking the long-text capability of clip

Internvideo: General video foundation models via generative and discriminative learning

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

mplug-2: A modularized multi-modal foundation model across text, image and video

Unmasked teacher: Towards training-efficient video foundation models