Specmaskgit: Masked generative modeling of audio spectrograms for efficient audio synthesis and beyond

M Comunità, Z Zhong, A Takahashi, S Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advances in generative models that iteratively synthesize audio clips sparked great
success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy …

Audiobox tta-rag: Improving zero-shot and few-shot text-to-audio with retrieval-augmented generation

M Yang, B Shi, M Le, WN Hsu, A Tjandra - arxiv preprint arxiv:2411.05141, 2024 - arxiv.org
Current leading Text-To-Audio (TTA) generation models suffer from degraded performance
on zero-shot and few-shot settings. It is often challenging to generate high-quality audio for …

FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation

H Liu, J Wang, R Huang, Y Liu, H Lu, W Xue… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in latent diffusion models (LDMs) have markedly enhanced text-to-
audio generation, yet their iterative sampling processes impose substantial computational …

FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation

J Im, J Nam - arxiv preprint arxiv:2501.10807, 2025 - arxiv.org
Versatile audio super-resolution (SR) is the challenging task of restoring high-frequency
components from low-resolution audio with sampling rates between 4kHz and 32kHz in …

[PDF][PDF] Generative and parametric models for interactive neural synthesis in speech and audio

MJC Largo - 2024 - oa.upm.es
Speech synthesis is a multifaceted process that encompasses both acoustic signals and
articulatory dynamics. Traditional neural audio synthesis methods often rely exclusively on …