- Academic Search

G Wijngaard, E Formisano, M Esposito… - IEEE …, 2025 - ieeexplore.ieee.org

Audio-language models (ALMs) generate linguistic descriptions of sound-producing events
and scenes. Advances in dataset creation and computational power have led to significant …

Enregistrer Citer Cité 2 fois Autres articles Les 3 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Ezaudio: Enhancing text-to-audio generation with efficient diffusion transformer

J Hai, Y Xu, H Zhang, C Li, H Wang, M Elhilali… - arxiv preprint arxiv …, 2024 - arxiv.org

Latent diffusion models have shown promising results in text-to-audio (T2A) generation
tasks, yet previous models have encountered difficulties in generation quality, computational …

Enregistrer Citer Cité 3 fois Autres articles Les 3 versions Free GPT-4 DeepSeek Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Challenge on sound scene synthesis: Evaluating text-to-audio generation

J Lee, M Tailleur, LM Heller, K Choi… - arxiv preprint arxiv …, 2024 - arxiv.org

Despite significant advancements in neural text-to-audio generation, challenges persist in
controllability and evaluation. This paper addresses these issues through the Sound Scene …

Enregistrer Citer Cité 1 fois Autres articles Les 4 versions Free GPT-4 DeepSeek Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

CY Hung, N Majumder, Z Kong, A Mehrish… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M
parameters, capable of generating up to 30 seconds of 44.1 kHz audio in just 3.7 seconds …

Enregistrer Citer Autres articles Les 5 versions Free GPT-4 DeepSeek Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

ETTA: Elucidating the Design Space of Text-to-Audio Models

S Lee, Z Kong, A Goel, S Kim, R Valle… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling
users to enrich their creative workflows with synthetic audio generated from natural …

Enregistrer Citer Autres articles Les 2 versions Free GPT-4 DeepSeek Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

H Wang, J Hai, YJ Lu, K Thakkar, M Elhilali… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we introduce SoloAudio, a novel diffusion-based generative model for target
sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the …

Enregistrer Citer Autres articles Les 2 versions Free GPT-4 DeepSeek Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sound Scene Synthesis at the DCASE 2024 Challenge

M Lagrange, J Lee, M Tailleur, LM Heller… - arxiv preprint arxiv …, 2025 - arxiv.org

This paper presents Task 7 at the DCASE 2024 Challenge: sound scene synthesis. Recent
advances in sound synthesis and generative models have enabled the creation of realistic …

Enregistrer Citer Autres articles Les 2 versions Free GPT-4 DeepSeek Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Fugatto 1: Foundational Generative Audio Transformer Opus 1

R Valle, R Badlani, Z Kong, S Lee, A Goel… - … Conference on Learning … - openreview.net

Fugatto is a versatile audio synthesis and transformation model capable of following free-
form text instructions with optional audio inputs. While large language models (LLMs) …

Enregistrer Citer Autres articles Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] preprints.org

[PDF][PDF] Continuous or Discrete, That Is the Question: A Survey on Large Multi-Modal Models from the Perspective of Input-Output Space Extension

Z Li, J Zhang, D Wang, Y Wang, X Huang, Z Wei - 2024 - preprints.org

With the success of large language models (LLMs) driving progress towards general-
purpose AI, there has been a growing focus on extending these models to multi-modal …

Enregistrer Citer Autres articles Les 2 versions Free GPT-4 DeepSeek Version HTML

Créer l'alerte

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

Improving text-to-audio models with synthetic captions

Audio-Language Datasets of Scenes and Events: A Survey

Ezaudio: Enhancing text-to-audio generation with efficient diffusion transformer

Challenge on sound scene synthesis: Evaluating text-to-audio generation

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

ETTA: Elucidating the Design Space of Text-to-Audio Models

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

Sound Scene Synthesis at the DCASE 2024 Challenge

Fugatto 1: Foundational Generative Audio Transformer Opus 1

[PDF][PDF] Continuous or Discrete, That Is the Question: A Survey on Large Multi-Modal Models from the Perspective of Input-Output Space Extension