„Google“ mokslinčius

L Ericsson, H Gouk, CC Loy… - IEEE Signal Processing …, 2022 - ieeexplore.ieee.org

Self-supervised representation learning (SSRL) methods aim to provide powerful, deep
feature learning without the requirement of large annotated data sets, thus alleviating the …

Išsaugoti Cituoti Cituoja 390 Susiję straipsniai Visos 6 versijos

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Self-supervised learning for videos: A survey

MC Schiappa, YS Rawat, M Shah - ACM Computing Surveys, 2023 - dl.acm.org

The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …

Išsaugoti Cituoti Cituoja 156 Susiję straipsniai Visos 4 versijos

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

Išsaugoti Cituoti Cituoja 866 Susiję straipsniai Visos 10 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Išsaugoti Cituoti Cituoja 649 Susiję straipsniai Visos 11 versijos

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022 - openaccess.thecvf.com

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

Išsaugoti Cituoti Cituoja 1016 Susiję straipsniai Visos 20 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding

M Afham, I Dissanayake… - Proceedings of the …, 2022 - openaccess.thecvf.com

Manual annotation of large-scale point cloud dataset for varying tasks such as 3D object
classification, segmentation and detection is often laborious owing to the irregular structure …

Išsaugoti Cituoti Cituoja 317 Susiję straipsniai Visos 7 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Multimae: Multi-modal multi-task masked autoencoders

R Bachmann, D Mizrahi, A Atanov, A Zamir - European Conference on …, 2022 - Springer

We propose a pre-training strategy called Multi-modal Multi-task Masked Autoencoders
(MultiMAE). It differs from standard Masked Autoencoding in two key aspects: I) it can …

Išsaugoti Cituoti Cituoja 283 Susiję straipsniai Visos 9 versijos

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Attention bottlenecks for multimodal fusion

A Nagrani, S Yang, A Arnab, A Jansen… - Advances in neural …, 2021 - proceedings.neurips.cc

Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …

Išsaugoti Cituoti Cituoja 649 Susiję straipsniai Visos 7 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Learning audio-visual speech representation by masked multimodal cluster prediction

B Shi, WN Hsu, K Lakhotia, A Mohamed - arxiv preprint arxiv:2201.02184, 2022 - arxiv.org

Video recordings of speech contain correlated audio and visual information, providing a
strong signal for speech representation learning from the speaker's lip movements and the …

Išsaugoti Cituoti Cituoja 333 Susiję straipsniai Visos 3 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Omnivore: A single model for many visual modalities

R Girdhar, M Singh, N Ravi… - Proceedings of the …, 2022 - openaccess.thecvf.com

Prior work has studied different visual modalities in isolation and developed separate
architectures for recognition of images, videos, and 3D data. Instead, in this paper, we …

Išsaugoti Cituoti Cituoja 251 Susiję straipsniai Visos 6 versijos HTML kopija

Kurti įspėjimą

Cituoti

Išplėstinė paieška

Išsaugota skiltyje „Mano biblioteka“

Audio-visual scene analysis with self-supervised multisensory features

Self-supervised representation learning: Introduction, advances, and challenges

Self-supervised learning for videos: A survey

Imagebind: One embedding space to bind them all

Multimodal learning with transformers: A survey

Ego4d: Around the world in 3,000 hours of egocentric video

Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding

Multimae: Multi-modal multi-task masked autoencoders

Attention bottlenecks for multimodal fusion

Learning audio-visual speech representation by masked multimodal cluster prediction

Omnivore: A single model for many visual modalities