- Academic Search

Z **ng, Q Feng, H Chen, Q Dai, H Hu, H Xu… - ACM Computing …, 2024 - dl.acm.org

The recent wave of AI-generated content (AIGC) has witnessed substantial success in
computer vision, with the diffusion model playing a crucial role in this achievement. Due to …

Speichern Zitieren Zitiert von: 90 Ähnliche Artikel Alle 3 Versionen

[Free GPT-4]

[PDF] nowpublishers.com

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Speichern Zitieren Zitiert von: 197 Ähnliche Artikel Alle 7 Versionen Bibliothekssuche HTML-Version

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Speichern Zitieren Zitiert von: 118 Ähnliche Artikel Alle 3 Versionen

[Free GPT-4]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Speichern Zitieren Zitiert von: 238 Ähnliche Artikel Alle 26 Versionen HTML-Version

[Free GPT-4]

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Speichern Zitieren Zitiert von: 625 Ähnliche Artikel Alle 9 Versionen

[Free GPT-4]

[PDF] neurips.cc

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023 - proceedings.neurips.cc

Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …

Speichern Zitieren Zitiert von: 101 Ähnliche Artikel Alle 6 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model

S Smith, M Patwary, B Norick, P LeGresley… - arxiv preprint arxiv …, 2022 - arxiv.org

Pretrained general-purpose language models can achieve state-of-the-art accuracies in
various natural language processing domains by adapting to downstream tasks via zero …

Speichern Zitieren Zitiert von: 681 Ähnliche Artikel Alle 4 Versionen HTML-Version

[Free GPT-4]

[PDF] thecvf.com

Unmasked teacher: Towards training-efficient video foundation models

K Li, Y Wang, Y Li, Y Wang, Y He… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Video Foundation Models (VFMs) have received limited exploration due to high
computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …

Speichern Zitieren Zitiert von: 142 Ähnliche Artikel Alle 5 Versionen HTML-Version

[Free GPT-4]

[PDF] thecvf.com

Mvimgnet: A large-scale dataset of multi-view images

X Yu, M Xu, Y Zhang, H Liu, C Ye… - Proceedings of the …, 2023 - openaccess.thecvf.com

Being data-driven is one of the most iconic properties of deep learning algorithms. The birth
of ImageNet drives a remarkable trend of" learning from large-scale data" in computer vision …

Speichern Zitieren Zitiert von: 142 Ähnliche Artikel Alle 5 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Paligemma: A versatile 3b vlm for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arxiv preprint arxiv …, 2024 - arxiv.org

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

Speichern Zitieren Zitiert von: 108 Ähnliche Artikel Alle 2 Versionen HTML-Version

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

Dense-captioning events in videos

A survey on video diffusion models

Vision-language pre-training: Basics, recent advances, and future trends

Internvideo2: Scaling foundation models for multimodal video understanding

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

Multimodal learning with transformers: A survey

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model

Unmasked teacher: Towards training-efficient video foundation models

Mvimgnet: A large-scale dataset of multi-view images

Paligemma: A versatile 3b vlm for transfer