- Academic Search

S Hu, L Zhou, S Liu, S Chen, L Meng, H Hao… - arxiv preprint arxiv …, 2024 - arxiv.org

The recent advancements in large language models (LLMs) have revolutionized the field of
natural language processing, progressively broadening their scope to multimodal …

Opslaan Citeren Geciteerd door 44 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]

[PDF] arxiv.org

Autoregressive speech synthesis without vector quantization

L Meng, L Zhou, S Liu, S Chen, B Han, S Hu… - arxiv preprint arxiv …, 2024 - arxiv.org

We present MELLE, a novel continuous-valued tokens based language modeling approach
for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel …

Opslaan Citeren Geciteerd door 20 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]

[PDF] arxiv.org

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts

SE Eskimez, X Wang, M Thakker, C Li… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org

This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-
autoregressive zero-shot text-to-speech system that offers human-level naturalness and …

Opslaan Citeren Geciteerd door 19 Verwante artikelen Alle 3 versies

[Free GPT-4]

[PDF] arxiv.org

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Opslaan Citeren Geciteerd door 2 Verwante artikelen HTML-versie

[Free GPT-4]

[PDF] arxiv.org

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Y Chen, Z Niu, Z Ma, K Deng, C Wang, J Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …

Opslaan Citeren Geciteerd door 16 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]

[PDF] openreview.net

Unistyle: Unified style modeling for speaking style captioning and stylistic speech synthesis

X Zhu, W Tian, X Wang, L He, Y **ao, X Wang… - Proceedings of the …, 2024 - dl.acm.org

Understanding the speaking style, such as the emotion of the interlocutor's speech, and
responding with speech in an appropriate style is a natural occurrence in human …

Opslaan Citeren Geciteerd door 5 Verwante artikelen Alle 2 versies

[Free GPT-4]

[PDF] arxiv.org

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

W Yu, S Wang, X Yang, X Chen, X Tian… - arxiv preprint arxiv …, 2024 - arxiv.org

Full-duplex multimodal large language models (LLMs) provide a unified framework for
addressing diverse speech understanding and generation tasks, enabling more natural and …

Opslaan Citeren Geciteerd door 5 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]

[PDF] ieee.org

Enhancing automatic speech recognition with personalized models: Improving accuracy through individualized fine-tuning

V Brydinskyi, D Sabodashko, Y Khoma… - IEEE …, 2024 - ieeexplore.ieee.org

Automatic speech recognition (ASR) systems have become increasingly popular in recent
years due to their ability to convert spoken language into text. Nonetheless, despite their …

Opslaan Citeren Geciteerd door 3 Verwante artikelen Alle 2 versies

[Free GPT-4]

[PDF] arxiv.org

BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data

M Łajszczak, G Cámbara, Y Li, F Beyhan… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf {B} $
ig $\textbf {A} $ daptive $\textbf {S} $ treamable TTS with $\textbf {E} $ mergent abilities …

Opslaan Citeren Geciteerd door 63 Verwante artikelen Alle 4 versies HTML-versie

[Free GPT-4]

[PDF] arxiv.org

Multimodal Latent Language Modeling with Next-Token Diffusion

Y Sun, H Bao, W Wang, Z Peng, L Dong… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal generative models require a unified approach to handle both discrete data (eg,
text and code) and continuous data (eg, image, audio, video). In this work, we propose …

Opslaan Citeren Geciteerd door 1 Verwante artikelen HTML-versie

Melding maken

Citeren

Geavanceerd zoeken

Opgeslagen in Mijn bibliotheek

Libriheavy: a 50,000 hours asr corpus with punctuation casing and context

Wavllm: Towards robust and adaptive speech large language model

Autoregressive speech synthesis without vector quantization

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Unistyle: Unified style modeling for speaking style captioning and stylistic speech synthesis

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

Enhancing automatic speech recognition with personalized models: Improving accuracy through individualized fine-tuning

BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data

Multimodal Latent Language Modeling with Next-Token Diffusion