Google Učenjak

S Islam, H Elmekki, A Elsebai, J Bentahar… - Expert Systems with …, 2024 - Elsevier

Abstract Transformers are Deep Neural Networks (DNN) that utilize a self-attention
mechanism to capture contextual relationships within sequential data. Unlike traditional …

Shrani Navedi Navedeno v 214 virih Sorodni članki Vse različice: 8

[Free GPT-4]
[DeepSeek]

[PDF] nowpublishers.com

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Shrani Navedi Navedeno v 199 virih Sorodni članki Vse različice: 7 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Mvbench: A comprehensive multi-modal video understanding benchmark

K Li, Y Wang, Y He, Y Li, Y Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

With the rapid development of Multi-modal Large Language Models (MLLMs) a number of
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …

Shrani Navedi Navedeno v 284 virih Sorodni članki Vse različice: 8 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Videochat: Chat-centric video understanding

KC Li, Y He, Y Wang, Y Li, W Wang, P Luo… - ar** an end-to-end chat-centric video
understanding system, coined as VideoChat. It integrates video foundation models and …

Shrani Navedi Navedeno v 601 virih Sorodni članki Vse različice: 6 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Visual chatgpt: Talking, drawing and editing with visual foundation models

C Wu, S Yin, W Qi, X Wang, Z Tang, N Duan - arxiv preprint arxiv …, 2023 - arxiv.org

ChatGPT is attracting a cross-field interest as it provides a language interface with
remarkable conversational competency and reasoning capabilities across many domains …

Shrani Navedi Navedeno v 677 virih Sorodni članki Vse različice: 3 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Videomamba: State space model for efficient video understanding

K Li, X Li, Y Wang, Y He, Y Wang, L Wang… - European Conference on …, 2024 - Springer

Addressing the dual challenges of local redundancy and global dependencies in video
understanding, this work innovatively adapts the Mamba to the video domain. The proposed …

Shrani Navedi Navedeno v 165 virih Sorodni članki Vse različice: 7

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Shrani Navedi Navedeno v 135 virih Sorodni članki Vse različice: 5

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Shrani Navedi Navedeno v 245 virih Sorodni članki Vse različice: 19 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Internvideo: General video foundation models via generative and discriminative learning

Y Wang, K Li, Y Li, Y He, B Huang, Z Zhao… - arxiv preprint arxiv …, 2022 - arxiv.org

The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …

Shrani Navedi Navedeno v 345 virih Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Y Wang, Y He, Y Li, K Li, J Yu, X Ma, X Li… - arxiv preprint arxiv …, 2023 - arxiv.org

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables
learning powerful and transferable video-text representations for multimodal understanding …

Shrani Navedi Navedeno v 235 virih Sorodni članki Vse različice: 5 V obliki HTML

Ustvari opozorilo

Navedi

Napredno iskanje

Shranjeno v Mojo knjižnico

Violet: End-to-end video-language transformers with masked visual-token modeling

A comprehensive survey on applications of transformers for deep learning tasks

Vision-language pre-training: Basics, recent advances, and future trends

Mvbench: A comprehensive multi-modal video understanding benchmark

Videochat: Chat-centric video understanding

Visual chatgpt: Talking, drawing and editing with visual foundation models

Videomamba: State space model for efficient video understanding

Internvideo2: Scaling foundation models for multimodal video understanding

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

Internvideo: General video foundation models via generative and discriminative learning

Internvid: A large-scale video-text dataset for multimodal understanding and generation