- Academic Search

J Gui, T Chen, J Zhang, Q Cao, Z Sun… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

Deep supervised learning algorithms typically require a large volume of labeled data to
achieve satisfactory performance. However, the process of collecting and labeling such data …

保存引用被引用次数：126 相关文章所有 3 个版本

[Free GPT-4]

[PDF] cell.com Full View

Audio self-supervised learning: A survey

S Liu, A Mallol-Ragolta, E Parada-Cabaleiro, K Qian… - Patterns, 2022 - cell.com

Similar to humans' cognitive ability to generalize knowledge and skills, self-supervised
learning (SSL) targets discovering general representations from large-scale data. This …

保存引用被引用次数：127 相关文章所有 12 个版本

[Free GPT-4]

[PDF] neurips.cc

Masked autoencoders as spatiotemporal learners

C Feichtenhofer, Y Li, K He - Advances in neural …, 2022 - proceedings.neurips.cc

This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to
spatiotemporal representation learning from videos. We randomly mask out spacetime …

保存引用被引用次数：555 相关文章所有 5 个版本 HTML 版

[Free GPT-4]

[PDF] neurips.cc

Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text

H Akbari, L Yuan, R Qian… - Advances in …, 2021 - proceedings.neurips.cc

We present a framework for learning multimodal representations from unlabeled data using
convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer …

保存引用被引用次数：692 相关文章所有 9 个版本 HTML 版

[Free GPT-4]

[PDF] neurips.cc

Siamese masked autoencoders

A Gupta, J Wu, J Deng, FF Li - Advances in Neural …, 2023 - proceedings.neurips.cc

Establishing correspondence between images or scenes is a significant challenge in
computer vision, especially given occlusions, viewpoint changes, and varying object …

保存引用被引用次数：62 相关文章所有 10 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Learning to exploit temporal structure for biomedical vision-language processing

S Bannur, S Hyland, Q Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Self-supervised learning in vision--language processing (VLP) exploits semantic alignment
between imaging and text modalities. Prior work in biomedical VLP has mostly relied on the …

保存引用被引用次数：107 相关文章所有 5 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Wav2clip: Learning robust audio representations from clip

HH Wu, P Seetharaman, K Kumar… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org

We propose Wav2CLIP, a robust audio representation learning method by distilling from
Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on …

保存引用被引用次数：281 相关文章所有 9 个版本

[Free GPT-4]

[PDF] thecvf.com

Verbs in action: Improving verb understanding in video-language models

L Momeni, M Caron, A Nagrani… - Proceedings of the …, 2023 - openaccess.thecvf.com

Understanding verbs is crucial to modelling how people and objects interact with each other
and the environment through space and time. Recently, state-of-the-art video-language …

保存引用被引用次数：69 相关文章所有 6 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition

Y Zhang, DS Park, W Han, J Qin… - IEEE Journal of …, 2022 - ieeexplore.ieee.org

We summarize the results of a host of efforts using giant automatic speech recognition (ASR)
models pre-trained using large, diverse unlabeled datasets containing approximately a …

保存引用被引用次数：201 相关文章所有 4 个版本

[Free GPT-4]

[PDF] arxiv.org

Contrastive audio-visual masked autoencoder

Y Gong, A Rouditchenko, AH Liu, D Harwath… - arxiv preprint arxiv …, 2022 - arxiv.org

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single
modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio …

保存引用被引用次数：141 相关文章所有 5 个版本 HTML 版

创建快讯

引用

高级搜索

已保存到“我的图书馆”

Broaden your views for self-supervised video learning

A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends

Audio self-supervised learning: A survey

Masked autoencoders as spatiotemporal learners

Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text

Siamese masked autoencoders

Learning to exploit temporal structure for biomedical vision-language processing

Wav2clip: Learning robust audio representations from clip

Verbs in action: Improving verb understanding in video-language models

Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition

Contrastive audio-visual masked autoencoder