Wavchat: A survey of spoken dialogue models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

Recent Advances in Discrete Speech Tokens: A Review

Y Guo, Z Li, H Wang, B Li, C Shao, H Zhang… - arxiv preprint arxiv …, 2025 - arxiv.org
The rapid advancement of speech generation technologies in the era of large language
models (LLMs) has established discrete speech tokens as a foundational paradigm for …

Neural Codec Source Tracing: Toward Comprehensive Attribution in Open-Set Condition

Y **e, X Wang, Z Wang, R Fu, Z Wen, S Cao… - arxiv preprint arxiv …, 2025 - arxiv.org
Current research in audio deepfake detection is gradually transitioning from binary
classification to multi-class tasks, referred as audio deepfake source tracing task. However …

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

L Della Libera, F Paissan, C Subakan… - arxiv preprint arxiv …, 2025 - arxiv.org
Large language models have revolutionized natural language processing through self-
supervised pretraining on massive datasets. Inspired by this success, researchers have …

The ICME 2025 Audio Encoder Capability Challenge

J Zhang, H Dinkel, Q Song, H Wang, Y Niu… - arxiv preprint arxiv …, 2025 - arxiv.org
This challenge aims to evaluate the capabilities of audio encoders, especially in the context
of multi-task learning and real-world applications. Participants are invited to submit pre …

LUCY: Linguistic Understanding and Control Yielding Early Stage of Her

H Gao, H Shao, X Wang, C Qiu, Y Shen, S Cai… - arxiv preprint arxiv …, 2025 - arxiv.org
The film Her features Samantha, a sophisticated AI audio agent who is capable of
understanding both linguistic and paralinguistic information in human speech and delivering …

CodecFake-Omni: A Large-Scale Codec-based Deepfake Speech Dataset

J Du, X Chen, H Wu, L Zhang, I Lin, I Chiu… - arxiv preprint arxiv …, 2025 - arxiv.org
With the rapid advancement of codec-based speech generation (CoSG) systems, creating
fake speech that mimics an individual's identity and spreads misinformation has become …

Artificial Intelligence in Creative Industries: Advances Prior to 2025

N Anantrasirichai, F Zhang, D Bull - arxiv preprint arxiv:2501.02725, 2025 - arxiv.org
The rapid advancements in artificial intelligence (AI), particularly in generative AI and large
language models (LLMs), have profoundly impacted the creative industries by enabling …

MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model

M Baas, P Scholtz, A Mehta, E Dyson… - arxiv preprint arxiv …, 2025 - arxiv.org
Codec-based text-to-speech (TTS) models have shown impressive quality with zero-shot
voice cloning abilities. However, they often struggle with more expressive references or …

DAC-JAX: A JAX Implementation of the Descript Audio Codec

D Braun - arxiv preprint arxiv:2405.11554, 2024 - arxiv.org
We present an open-source implementation of the Descript Audio Codec (DAC) using
Google's JAX ecosystem of Flax, Optax, Orbax, AUX, and CLU. Our codebase enables the …