Foundation models for music: A survey
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …
Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization
Generative multimodal content is increasingly prevalent in much of the content creation
arena, as it has the potential to allow artists and media personnel to create pre-production …
arena, as it has the potential to allow artists and media personnel to create pre-production …
Lauragpt: Listen, attend, understand, and regenerate audio with gpt
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance
on various natural language processing tasks, and have shown great potential as …
on various natural language processing tasks, and have shown great potential as …
Autoregressive speech synthesis without vector quantization
We present MELLE, a novel continuous-valued tokens based language modeling approach
for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel …
for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel …
Audio-Language Datasets of Scenes and Events: A Survey
Audio-language models (ALMs) generate linguistic descriptions of sound-producing events
and scenes. Advances in dataset creation and computational power have led to significant …
and scenes. Advances in dataset creation and computational power have led to significant …
Flashspeech: Efficient zero-shot speech synthesis
Recent progress in large-scale zero-shot speech synthesis has been significantly advanced
by language models and diffusion models. However, the generation process of both …
by language models and diffusion models. However, the generation process of both …
Picoaudio: Enabling precise timestamp and frequency controllability of audio events in text-to-audio generation
Recently, audio generation tasks have attracted considerable research interests. Precise
temporal controllability is essential to integrate audio generation with real applications. In …
temporal controllability is essential to integrate audio generation with real applications. In …
Recommendation with generative models
Generative models are a class of AI models capable of creating new instances of data by
learning and sampling from their statistical distributions. In recent years, these models have …
learning and sampling from their statistical distributions. In recent years, these models have …
Controlspeech: Towards simultaneous zero-shot speaker cloning and zero-shot language style control with decoupled codec
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …
Unistyle: Unified style modeling for speaking style captioning and stylistic speech synthesis
Understanding the speaking style, such as the emotion of the interlocutor's speech, and
responding with speech in an appropriate style is a natural occurrence in human …
responding with speech in an appropriate style is a natural occurrence in human …