Cwcl: Cross-modal transfer with continuously weighted contrastive loss

RS Srinivasa, J Cho, C Yang… - Advances in …, 2023‏ - proceedings.neurips.cc
This paper considers contrastive training for cross-modal 0-shot transfer wherein a pre-
trained model in one modality is used for representation learning in another domain using …

EAT: Self-supervised pre-training with efficient audio transformer

W Chen, Y Liang, Z Ma, Z Zheng, X Chen - arxiv preprint arxiv …, 2024‏ - arxiv.org
Audio self-supervised learning (SSL) pre-training, which aims to learn good representations
from unlabeled audio, has made remarkable progress. However, the extensive …

Towards open respiratory acoustic foundation models: Pretraining and benchmarking

Y Zhang, T **a, J Han, Y Wu, G Rizos, Y Liu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Respiratory audio, such as coughing and breathing sounds, has predictive power for a wide
range of healthcare applications, yet is currently under-explored. The main problem for …

Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks

X Li, N Shao, X Li - IEEE/ACM Transactions on Audio, Speech …, 2024‏ - ieeexplore.ieee.org
Self-supervised learning (SSL) has emerged as a popular approach for learning audio
representations. One goal of audio self-supervised pre-training is to transfer knowledge to …

Saic: Integration of speech anonymization and identity classification

M Cheng, X Diao, S Cheng, W Liu - AI for Health Equity and Fairness …, 2024‏ - Springer
Speech anonymization and de-identification have garnered significant attention recently,
especially in the healthcare area including telehealth consultations, patient voiceprint …

Perceptual musical features for interpretable audio tagging

V Lyberatos, S Kantarelis, E Dervakos… - … on Acoustics, Speech …, 2024‏ - ieeexplore.ieee.org
In the age of music streaming platforms, the task of automatically tagging music audio has
garnered significant attention, driving researchers to devise methods aimed at enhancing …

Audio-Language Models for Audio-Centric Tasks: A survey

Y Su, J Bai, Q Xu, K Xu, Y Dou - arxiv preprint arxiv:2501.15177, 2025‏ - arxiv.org
Audio-Language Models (ALMs), which are trained on audio-text data, focus on the
processing, understanding, and reasoning of sounds. Unlike traditional supervised learning …

Masked modeling duo for speech: Specializing general-purpose audio representation to speech using denoising distillation

D Niizumi, D Takeuchi, Y Ohishi, N Harada… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Self-supervised learning general-purpose audio representations have demonstrated high
performance in a variety of tasks. Although they can be optimized for application by fine …

Mdrt: Multi-domain synthetic speech localization

AKS Yadav, K Bhagtani, S Baireddy… - ICASSP 2024-2024 …, 2024‏ - ieeexplore.ieee.org
With recent advancements in generating synthetic speech, tools to generate high-quality
synthetic speech impersonating any human speaker are easily available. Several incidents …

Synthax: A fast modular synthesizer in jax

M Cherep, N Singh - Audio Engineering Society Convention 155, 2023‏ - aes.org
Modern audio production relies heavily on realtime audio synthesis. However, accelerating
audio synthesis far beyond realtime speeds has a significant role to play in advancing …