Wavllm: Towards robust and adaptive speech large language model
The recent advancements in large language models (LLMs) have revolutionized the field of
natural language processing, progressively broadening their scope to multimodal …
natural language processing, progressively broadening their scope to multimodal …
Autoregressive speech synthesis without vector quantization
We present MELLE, a novel continuous-valued tokens based language modeling approach
for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel …
for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel …
E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-
autoregressive zero-shot text-to-speech system that offers human-level naturalness and …
autoregressive zero-shot text-to-speech system that offers human-level naturalness and …
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …
F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …
Unistyle: Unified style modeling for speaking style captioning and stylistic speech synthesis
Understanding the speaking style, such as the emotion of the interlocutor's speech, and
responding with speech in an appropriate style is a natural occurrence in human …
responding with speech in an appropriate style is a natural occurrence in human …
SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation
Full-duplex multimodal large language models (LLMs) provide a unified framework for
addressing diverse speech understanding and generation tasks, enabling more natural and …
addressing diverse speech understanding and generation tasks, enabling more natural and …
Enhancing automatic speech recognition with personalized models: Improving accuracy through individualized fine-tuning
V Brydinskyi, D Sabodashko, Y Khoma… - IEEE …, 2024 - ieeexplore.ieee.org
Automatic speech recognition (ASR) systems have become increasingly popular in recent
years due to their ability to convert spoken language into text. Nonetheless, despite their …
years due to their ability to convert spoken language into text. Nonetheless, despite their …
BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf {B} $
ig $\textbf {A} $ daptive $\textbf {S} $ treamable TTS with $\textbf {E} $ mergent abilities …
ig $\textbf {A} $ daptive $\textbf {S} $ treamable TTS with $\textbf {E} $ mergent abilities …
Multimodal Latent Language Modeling with Next-Token Diffusion
Multimodal generative models require a unified approach to handle both discrete data (eg,
text and code) and continuous data (eg, image, audio, video). In this work, we propose …
text and code) and continuous data (eg, image, audio, video). In this work, we propose …