- Academic Search

A Défossez, L Mazaré, M Orsini, A Royer… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue
framework. Current systems for spoken dialogue rely on pipelines of independent …

Zapisz Cytuj Cytowane przez 45 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

[Free GPT-4]
[DeepSeek]

[HTML] mdpi.com

[HTML][HTML] Video and audio deepfake datasets and open issues in deepfake technology: being ahead of the curve

Z Akhtar, TL Pendyala, VS Athmakuri - Forensic Sciences, 2024 - mdpi.com

The revolutionary breakthroughs in Machine Learning (ML) and Artificial Intelligence (AI) are
extensively being harnessed across a diverse range of domains, eg, forensic science …

Zapisz Cytuj Cytowane przez 7 Powiązane artykuły Kopia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Masked generative video-to-audio transformers with enhanced synchronicity

S Pascual, C Yeh, I Tsiamas, J Serrà - European Conference on Computer …, 2024 - Springer

Abstract Video-to-audio (V2A) generation leverages visual-only video features to render
plausible sounds that match the scene. Importantly, the generated sound onsets should …

Zapisz Cytuj Cytowane przez 8 Powiązane artykuły Wszystkie wersje 6

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications

HH Guo, K Liu, FY Shen, YC Wu, FL **e, K **e… - arxiv preprint arxiv …, 2024 - arxiv.org

This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the
growing demands for personalized and diverse generative speech applications. The …

Zapisz Cytuj Cytowane przez 12 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis

SH Lee, HY Choi, SB Kim, SW Lee - arxiv preprint arxiv:2311.12454, 2023 - arxiv.org

Large language models (LLM)-based speech synthesis has been widely adopted in zero-
shot speech synthesis. However, they require a large-scale data and possess the same …

Zapisz Cytuj Cytowane przez 28 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder

SB Kim, SH Lee, HY Choi… - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org

This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes
robust speech representation learning with various masking strategies. Recently, masked …

Zapisz Cytuj Cytowane przez 12 Powiązane artykuły Wszystkie wersje 3

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Specmaskgit: Masked generative modeling of audio spectrograms for efficient audio synthesis and beyond

M Comunità, Z Zhong, A Takahashi, S Yang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advances in generative models that iteratively synthesize audio clips sparked great
success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy …

Zapisz Cytuj Cytowane przez 2 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

MusicHiFi: Fast high-fidelity stereo vocoding

G Zhu, JP Caceres, Z Duan, NJ Bryan - arxiv preprint arxiv:2403.10493, 2024 - arxiv.org

Diffusion-based audio and music generation models commonly generate music by
constructing an image representation of audio (eg, a mel-spectrogram) and then converting …

Zapisz Cytuj Cytowane przez 4 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Wave-u-mamba: an end-to-end framework for high-quality and efficient speech super resolution

Y Lee, C Kim - arxiv preprint arxiv:2409.09337, 2024 - arxiv.org

Speech Super-Resolution (SSR) is a task of enhancing low-resolution speech signals by
restoring missing high-frequency components. Conventional approaches typically …

Zapisz Cytuj Cytowane przez 2 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors

J Hauret, M Olivier, T Joubaud, C Langrenne… - arxiv preprint arxiv …, 2024 - arxiv.org

Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR)
containing audio recordings using five different body-conduction audio sensors: two in-ear …

Zapisz Cytuj Cytowane przez 1 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

AudioSR: Versatile audio super-resolution at scale

Moshi: a speech-text foundation model for real-time dialogue

[HTML][HTML] Video and audio deepfake datasets and open issues in deepfake technology: being ahead of the curve

Masked generative video-to-audio transformers with enhanced synchronicity

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications

Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis

Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder

Specmaskgit: Masked generative modeling of audio spectrograms for efficient audio synthesis and beyond

MusicHiFi: Fast high-fidelity stereo vocoding

Wave-u-mamba: an end-to-end framework for high-quality and efficient speech super resolution

Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors