Audio self-supervised learning: A survey

S Liu, A Mallol-Ragolta, E Parada-Cabaleiro, K Qian… - Patterns, 2022 - cell.com
Similar to humans' cognitive ability to generalize knowledge and skills, self-supervised
learning (SSL) targets discovering general representations from large-scale data. This …

Emotion recognition using different sensors, emotion models, methods and datasets: A comprehensive review

Y Cai, X Li, J Li - Sensors, 2023 - mdpi.com
In recent years, the rapid development of sensors and information technology has made it
possible for machines to recognize and analyze human emotions. Emotion recognition is an …

Survey of deep representation learning for speech emotion recognition

S Latif, R Rana, S Khalifa, R Jurdak… - IEEE Transactions …, 2021 - ieeexplore.ieee.org
Traditionally, speech emotion recognition (SER) research has relied on manually
handcrafted acoustic features using feature engineering. However, the design of …

Self supervised adversarial domain adaptation for cross-corpus and cross-language speech emotion recognition

S Latif, R Rana, S Khalifa, R Jurdak… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Despite the recent advancement in speech emotion recognition (SER) within a single corpus
setting, the performance of these SER systems degrades significantly for cross-corpus and …

Machine learning for stuttering identification: Review, challenges and future directions

SA Sheikh, M Sahidullah, F Hirsch, S Ouni - Neurocomputing, 2022 - Elsevier
Stuttering is a speech disorder during which the flow of speech is interrupted by involuntary
pauses and repetition of sounds. Stuttering identification is an interesting interdisciplinary …

Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition

X Pan, P Chen, Y Gong, H Zhou, X Wang… - arxiv preprint arxiv …, 2022 - arxiv.org
Training Transformer-based models demands a large amount of data, while obtaining
aligned and labelled data in multimodality is rather cost-demanding, especially for audio …

Septr: Separable transformer for audio spectrogram processing

NC Ristea, RT Ionescu, FS Khan - arxiv preprint arxiv:2203.09581, 2022 - arxiv.org
Following the successful application of vision transformers in multiple computer vision tasks,
these models have drawn the attention of the signal processing community. This is because …

Universal facial encoding of codec avatars from vr headsets

S Bai, TL Wang, C Li, A Venkatesh, T Simon… - arxiv preprint arxiv …, 2024 - arxiv.org
Faithful real-time facial animation is essential for avatar-mediated telepresence in Virtual
Reality (VR). To emulate authentic communication, avatar animation needs to be efficient …

Similarity analysis of self-supervised speech representations

YA Chung, Y Belinkov, J Glass - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org
Self-supervised speech representation learning has recently been a prosperous research
topic. Many algorithms have been proposed for learning useful representations from large …

Self-paced ensemble learning for speech and audio classification

NC Ristea, RT Ionescu - arxiv preprint arxiv:2103.11988, 2021 - arxiv.org
Combining multiple machine learning models into an ensemble is known to provide superior
performance levels compared to the individual components forming the ensemble. This is …