Towards audio language modeling--an overview

H Wu, X Chen, YC Lin, K Chang, HL Chung… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Neural audio codecs are initially introduced to compress audio data into compact codes to
reduce transmission latency. Researchers recently discovered the potential of codecs as …

Vall-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers

S Chen, S Liu, L Zhou, Y Liu, X Tan, J Li, S Zhao… - arxiv preprint arxiv …, 2024‏ - arxiv.org
This paper introduces VALL-E 2, the latest advancement in neural codec language models
that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity …

Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec

Z Du, S Zhang, K Hu, S Zheng - ICASSP 2024-2024 IEEE …, 2024‏ - ieeexplore.ieee.org
This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an
extension of the open-source speech processing toolkit FunASR. FunCodec provides …

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

S Ji, Z Jiang, W Wang, Y Chen, M Fang, J Zuo… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Language models have been effectively applied to modeling natural signals, such as
images, video, speech, and audio. A crucial component of these models is the codec …

Codec-superb@ slt 2024: A lightweight benchmark for neural audio codec models

H Wu, X Chen, YC Lin, K Chang, J Du… - 2024 IEEE Spoken …, 2024‏ - ieeexplore.ieee.org
Neural audio codec models are becoming increasingly important as they serve as
tokenizers for audio, enabling efficient transmission or facilitating speech language …

Advancing large language models to capture varied speaking styles and respond properly in spoken conversations

GT Lin, CH Chiang, H Lee - arxiv preprint arxiv:2402.12786, 2024‏ - arxiv.org
In spoken dialogue, even if two current turns are the same sentence, their responses might
still differ when they are spoken in different styles. The spoken styles, containing …

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Y Chen, Z Niu, Z Ma, K Deng, C Wang, J Zhao… - arxiv preprint arxiv …, 2024‏ - arxiv.org
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …

Repcodec: A speech representation codec for speech tokenization

Z Huang, C Meng, T Ko - arxiv preprint arxiv:2309.00169, 2023‏ - arxiv.org
With recent rapid growth of large language models (LLMs), discrete speech tokenization has
played an important role for injecting speech into LLMs. However, this discretization gives …

Codec-SUPERB: An in-depth analysis of sound codec models

H Wu, HL Chung, YC Lin, YK Wu, X Chen… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The sound codec's dual roles in minimizing data transmission latency and serving as
tokenizers underscore its critical importance. Recent years have witnessed significant …

APCodec: A neural audio codec with parallel amplitude and phase spectrum encoding and decoding

Y Ai, XH Jiang, YX Lu, HP Du… - IEEE/ACM Transactions …, 2024‏ - ieeexplore.ieee.org
This paper introduces a novel neural audio codec targeting high waveform sampling rates
and low bitrates named APCodec, which seamlessly integrates the strengths of parametric …