Sparks of large audio models: A survey and outlook
This survey paper provides a comprehensive overview of the recent advancements and
challenges in applying large language models to the field of audio signal processing. Audio …
challenges in applying large language models to the field of audio signal processing. Audio …
Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research
The advancement of audio-language (AL) multimodal learning tasks has been significant in
recent years, yet the limited size of existing audio-language datasets poses challenges for …
recent years, yet the limited size of existing audio-language datasets poses challenges for …
Audiobox: Unified audio generation with natural language prompts
Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …
consuming. Research communities have made great progress over the past year advancing …
[HTML][HTML] Video and audio deepfake datasets and open issues in deepfake technology: being ahead of the curve
Z Akhtar, TL Pendyala, VS Athmakuri - Forensic Sciences, 2024 - mdpi.com
The revolutionary breakthroughs in Machine Learning (ML) and Artificial Intelligence (AI) are
extensively being harnessed across a diverse range of domains, eg, forensic science …
extensively being harnessed across a diverse range of domains, eg, forensic science …
Masked generative video-to-audio transformers with enhanced synchronicity
Abstract Video-to-audio (V2A) generation leverages visual-only video features to render
plausible sounds that match the scene. Importantly, the generated sound onsets should …
plausible sounds that match the scene. Importantly, the generated sound onsets should …
Improving text-to-audio models with synthetic captions
It is an open challenge to obtain high quality training data, especially captions, for text-to-
audio models. Although prior methods have leveraged\textit {text-only language models} to …
audio models. Although prior methods have leveraged\textit {text-only language models} to …
Ditto: Diffusion inference-time t-optimization for music generation
We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose frame-
work for controlling pre-trained text-to-music diffusion models at inference-time via …
work for controlling pre-trained text-to-music diffusion models at inference-time via …
Picoaudio: Enabling precise timestamp and frequency controllability of audio events in text-to-audio generation
Recently, audio generation tasks have attracted considerable research interests. Precise
temporal controllability is essential to integrate audio generation with real applications. In …
temporal controllability is essential to integrate audio generation with real applications. In …
Lcfed: An efficient clustered federated learning framework for heterogeneous data
Y Zhang, H Chen, Z Lin, Z Chen, J Zhao - arxiv preprint arxiv:2501.01850, 2025 - arxiv.org
Clustered federated learning (CFL) addresses the performance challenges posed by data
heterogeneity in federated learning (FL) by organizing edge devices with similar data …
heterogeneity in federated learning (FL) by organizing edge devices with similar data …
Musicflow: Cascaded flow matching for text guided music generation
We introduce MusicFlow, a cascaded text-to-music generation model based on flow
matching. Based on self-supervised representations to bridge between text descriptions and …
matching. Based on self-supervised representations to bridge between text descriptions and …