Moshi: a speech-text foundation model for real-time dialogue

A Défossez, L Mazaré, M Orsini, A Royer… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue
framework. Current systems for spoken dialogue rely on pipelines of independent …

[HTML][HTML] Video and audio deepfake datasets and open issues in deepfake technology: being ahead of the curve

Z Akhtar, TL Pendyala, VS Athmakuri - Forensic Sciences, 2024 - mdpi.com
The revolutionary breakthroughs in Machine Learning (ML) and Artificial Intelligence (AI) are
extensively being harnessed across a diverse range of domains, eg, forensic science …

Masked generative video-to-audio transformers with enhanced synchronicity

S Pascual, C Yeh, I Tsiamas, J Serrà - European Conference on Computer …, 2024 - Springer
Abstract Video-to-audio (V2A) generation leverages visual-only video features to render
plausible sounds that match the scene. Importantly, the generated sound onsets should …

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications

HH Guo, K Liu, FY Shen, YC Wu, FL **e, K **e… - arxiv preprint arxiv …, 2024 - arxiv.org
This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the
growing demands for personalized and diverse generative speech applications. The …

Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis

SH Lee, HY Choi, SB Kim, SW Lee - arxiv preprint arxiv:2311.12454, 2023 - arxiv.org
Large language models (LLM)-based speech synthesis has been widely adopted in zero-
shot speech synthesis. However, they require a large-scale data and possess the same …

Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder

SB Kim, SH Lee, HY Choi… - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes
robust speech representation learning with various masking strategies. Recently, masked …

Specmaskgit: Masked generative modeling of audio spectrograms for efficient audio synthesis and beyond

M Comunità, Z Zhong, A Takahashi, S Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advances in generative models that iteratively synthesize audio clips sparked great
success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy …

MusicHiFi: Fast high-fidelity stereo vocoding

G Zhu, JP Caceres, Z Duan, NJ Bryan - arxiv preprint arxiv:2403.10493, 2024 - arxiv.org
Diffusion-based audio and music generation models commonly generate music by
constructing an image representation of audio (eg, a mel-spectrogram) and then converting …

Wave-u-mamba: an end-to-end framework for high-quality and efficient speech super resolution

Y Lee, C Kim - arxiv preprint arxiv:2409.09337, 2024 - arxiv.org
Speech Super-Resolution (SSR) is a task of enhancing low-resolution speech signals by
restoring missing high-frequency components. Conventional approaches typically …

Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors

J Hauret, M Olivier, T Joubaud, C Langrenne… - arxiv preprint arxiv …, 2024 - arxiv.org
Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR)
containing audio recordings using five different body-conduction audio sensors: two in-ear …