Google znalac

Z Tang, Z Yang, C Zhu, M Zeng… - Advances in Neural …, 2023 - proceedings.neurips.cc

Abstract We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image, video, or audio …

Spremi Citiraj Spominje se 146 puta Srodni članci Svih 8 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vision transformers are parameter-efficient audio-visual learners

YB Lin, YL Sung, J Lei, M Bansal… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …

Spremi Citiraj Spominje se 78 puta Srodni članci Svih 5 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Mavil: Masked audio-video learners

PY Huang, V Sharma, H Xu, C Ryali… - Advances in …, 2023 - proceedings.neurips.cc

Abstract We present Masked Audio-Video Learners (MAViL) to learn audio-visual
representations with three complementary forms of self-supervision:(1) reconstructing …

Spremi Citiraj Spominje se 67 puta Srodni članci Svih 7 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Vidchapters-7m: Video chapters at scale

A Yang, A Nagrani, I Laptev, J Sivic… - Advances in Neural …, 2023 - proceedings.neurips.cc

Segmenting untrimmed videos into chapters enables users to quickly navigate to the
information of their interest. This important topic has been understudied due to the lack of …

Spremi Citiraj Spominje se 27 puta Srodni članci Svih 15 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Clippo: Image-and-language understanding from pixels only

M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Multimodal models are becoming increasingly effective, in part due to unified components,
such as the Transformer architecture. However, multimodal models still often consist of many …

Spremi Citiraj Spominje se 50 puta Srodni članci Svih 6 inačica Prikaži kao HTML

Valor: Vision-audio-language omni-perception pretraining model and dataset

J Liu, S Chen, X He, L Guo, X Zhu… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

In this paper, we propose the Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multimodal understanding and generation. Unlike widely-studied vision …

Spremi Citiraj Spominje se 7 puta Srodni članci Svih 6 inačica

[Free GPT-4]
[DeepSeek]

[PDF] github.io

[PDF][PDF] Crema: Multimodal compositional video reasoning via efficient modular adaptation and fusion

S Yu, J Yoon, M Bansal - ar**

S Sastry, S Khanal, A Dhakal… - Proceedings of the …, 2024 - openaccess.thecvf.com

We propose a metadata-aware self-supervised learning (SSL) framework useful for fine-
grained classification and ecological map** of bird species around the world. Our …

Spremi Citiraj Spominje se 8 puta Srodni članci Svih 7 inačica Prikaži kao HTML

Stvori obavijest

Citiraj

Napredno pretraživanje

Spremljeno u Moju knjižnicu

Tvlt: Textless vision-language transformer

Any-to-any generation via composable diffusion

Vision transformers are parameter-efficient audio-visual learners

Mavil: Masked audio-video learners

Vidchapters-7m: Video chapters at scale

Clippo: Image-and-language understanding from pixels only

Valor: Vision-audio-language omni-perception pretraining model and dataset

[PDF][PDF] Crema: Multimodal compositional video reasoning via efficient modular adaptation and fusion