Towards audio language modeling--an overview

H Wu, X Chen, YC Lin, K Chang, HL Chung… - arxiv preprint arxiv …, 2024 - arxiv.org
Neural audio codecs are initially introduced to compress audio data into compact codes to
reduce transmission latency. Researchers recently discovered the potential of codecs as …

Sparks of large audio models: A survey and outlook

S Latif, M Shoukat, F Shamshad, M Usama… - arxiv preprint arxiv …, 2023 - arxiv.org
This survey paper provides a comprehensive overview of the recent advancements and
challenges in applying large language models to the field of audio signal processing. Audio …

Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models

Y Chu, J Xu, X Zhou, Q Yang, S Zhang, Z Yan… - arxiv preprint arxiv …, 2023 - arxiv.org
Recently, instruction-following audio-language models have received broad attention for
audio interaction with humans. However, the absence of pre-trained audio models capable …

Uniaudio: An audio foundation model toward universal audio generation

D Yang, J Tian, X Tan, R Huang, S Liu, X Chang… - arxiv preprint arxiv …, 2023 - arxiv.org
Large Language models (LLM) have demonstrated the capability to handle a variety of
generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific …

Vall-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers

S Chen, S Liu, L Zhou, Y Liu, X Tan, J Li, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper introduces VALL-E 2, the latest advancement in neural codec language models
that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity …

Uniaudio: Towards universal audio generation with large language models

D Yang, J Tian, X Tan, R Huang, S Liu… - … on Machine Learning, 2024 - openreview.net
Audio generation is a major branch of generative AI research. Compared with prior works in
this area that are commonly task-specific with heavy domain knowledge, this paper …

Ella-v: Stable neural codec language modeling with alignment-guided sequence reordering

Y Song, Z Chen, X Wang, Z Ma, X Chen - arxiv preprint arxiv:2401.07333, 2024 - arxiv.org
The language model (LM) approach based on acoustic and linguistic prompts, such as
VALL-E, has achieved remarkable progress in the field of zero-shot audio generation …

Base tts: Lessons from building a billion-parameter text-to-speech model on 100k hours of data

M Łajszczak, G Cámbara, Y Li, F Beyhan… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf {B} $
ig $\textbf {A} $ daptive $\textbf {S} $ treamable TTS with $\textbf {E} $ mergent abilities …

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts

SE Eskimez, X Wang, M Thakker, C Li… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-
autoregressive zero-shot text-to-speech system that offers human-level naturalness and …

Codec-SUPERB: An in-depth analysis of sound codec models

H Wu, HL Chung, YC Lin, YK Wu, X Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
The sound codec's dual roles in minimizing data transmission latency and serving as
tokenizers underscore its critical importance. Recent years have witnessed significant …