- Academic Search

A Ulhaq, N Akhtar, G Pogrebna, A Mian - arxiv preprint arxiv:2209.05700, 2022 - arxiv.org

Vision transformers are emerging as a powerful tool to solve computer vision problems.
Recent techniques have also proven the efficacy of transformers beyond the image domain …

Tallenna Viittaa Viittausten määrä 66 Aiheeseen liittyviä artikkeleita Kaikki 4 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Multimodal fusion on low-quality data: A comprehensive survey

Q Zhang, Y Wei, Z Han, H Fu, X Peng, C Deng… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal fusion focuses on integrating information from multiple modalities with the goal of
more accurate prediction, which has achieved remarkable progress in a wide range of …

Tallenna Viittaa Viittausten määrä 24 Aiheeseen liittyviä artikkeleita Kaikki 3 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

Tallenna Viittaa Viittausten määrä 856 Aiheeseen liittyviä artikkeleita Kaikki 10 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

K Grauman, A Westbury, L Torresani… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …

Tallenna Viittaa Viittausten määrä 134 Aiheeseen liittyviä artikkeleita Kaikki 11 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Onellm: One framework to align all modalities with language

J Han, K Gong, Y Zhang, J Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …

Tallenna Viittaa Viittausten määrä 102 Aiheeseen liittyviä artikkeleita Kaikki 6 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Tallenna Viittaa Viittausten määrä 646 Aiheeseen liittyviä artikkeleita Kaikki 11 versiota

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Learning video representations from large language models

Y Zhao, I Misra, P Krähenbühl… - Proceedings of the …, 2023 - openaccess.thecvf.com

We introduce LAVILA, a new approach to learning video-language representations by
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …

Tallenna Viittaa Viittausten määrä 179 Aiheeseen liittyviä artikkeleita Kaikki 8 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

St-adapter: Parameter-efficient image-to-video transfer learning

J Pan, Z Lin, X Zhu, J Shao, H Li - Advances in Neural …, 2022 - proceedings.neurips.cc

Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …

Tallenna Viittaa Viittausten määrä 264 Aiheeseen liittyviä artikkeleita Kaikki 8 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

VW Liang, Y Zhang, Y Kwon… - Advances in Neural …, 2022 - proceedings.neurips.cc

We present modality gap, an intriguing geometric phenomenon of the representation space
of multi-modal models. Specifically, we show that different data modalities (eg images and …

Tallenna Viittaa Viittausten määrä 393 Aiheeseen liittyviä artikkeleita Kaikki 7 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Frozen clip models are efficient video learners

Z Lin, S Geng, R Zhang, P Gao, G De Melo… - … on Computer Vision, 2022 - Springer

Video recognition has been dominated by the end-to-end learning paradigm–first initializing
a video recognition model with weights of a pretrained image model and then conducting …

Tallenna Viittaa Viittausten määrä 240 Aiheeseen liittyviä artikkeleita Kaikki 7 versiota

Luo ilmoitus

Viittaa

Tarkennettu haku

Tallennettu omaan kirjastoon

Omnivore: A single model for many visual modalities

Vision transformers for action recognition: A survey

Multimodal fusion on low-quality data: A comprehensive survey

Imagebind: One embedding space to bind them all

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Onellm: One framework to align all modalities with language

Multimodal learning with transformers: A survey

Learning video representations from large language models

St-adapter: Parameter-efficient image-to-video transfer learning

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

Frozen clip models are efficient video learners