Sparks of large audio models: A survey and outlook

S Latif, M Shoukat, F Shamshad, M Usama… - arxiv preprint arxiv …, 2023 - arxiv.org
This survey paper provides a comprehensive overview of the recent advancements and
challenges in applying large language models to the field of audio signal processing. Audio …

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

X Mei, C Meng, H Liu, Q Kong, T Ko… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
The advancement of audio-language (AL) multimodal learning tasks has been significant in
recent years, yet the limited size of existing audio-language datasets poses challenges for …

Audiobox: Unified audio generation with natural language prompts

A Vyas, B Shi, M Le, A Tjandra, YC Wu, B Guo… - arxiv preprint arxiv …, 2023 - arxiv.org
Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …

[HTML][HTML] Video and audio deepfake datasets and open issues in deepfake technology: being ahead of the curve

Z Akhtar, TL Pendyala, VS Athmakuri - Forensic Sciences, 2024 - mdpi.com
The revolutionary breakthroughs in Machine Learning (ML) and Artificial Intelligence (AI) are
extensively being harnessed across a diverse range of domains, eg, forensic science …

Masked generative video-to-audio transformers with enhanced synchronicity

S Pascual, C Yeh, I Tsiamas, J Serrà - European Conference on Computer …, 2024 - Springer
Abstract Video-to-audio (V2A) generation leverages visual-only video features to render
plausible sounds that match the scene. Importantly, the generated sound onsets should …

Improving text-to-audio models with synthetic captions

Z Kong, S Lee, D Ghosal, N Majumder… - arxiv preprint arxiv …, 2024 - arxiv.org
It is an open challenge to obtain high quality training data, especially captions, for text-to-
audio models. Although prior methods have leveraged\textit {text-only language models} to …

Ditto: Diffusion inference-time t-optimization for music generation

Z Novack, J McAuley, T Berg-Kirkpatrick… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose frame-
work for controlling pre-trained text-to-music diffusion models at inference-time via …

Picoaudio: Enabling precise timestamp and frequency controllability of audio events in text-to-audio generation

Z **e, X Xu, Z Wu, M Wu - arxiv preprint arxiv:2407.02869, 2024 - arxiv.org
Recently, audio generation tasks have attracted considerable research interests. Precise
temporal controllability is essential to integrate audio generation with real applications. In …

Lcfed: An efficient clustered federated learning framework for heterogeneous data

Y Zhang, H Chen, Z Lin, Z Chen, J Zhao - arxiv preprint arxiv:2501.01850, 2025 - arxiv.org
Clustered federated learning (CFL) addresses the performance challenges posed by data
heterogeneity in federated learning (FL) by organizing edge devices with similar data …

Musicflow: Cascaded flow matching for text guided music generation

KR Prajwal, B Shi, M Lee, A Vyas, A Tjandra… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce MusicFlow, a cascaded text-to-music generation model based on flow
matching. Based on self-supervised representations to bridge between text descriptions and …