Advancements in Generative AI: A Comprehensive Review of GANs, GPT, Autoencoders, Diffusion Model, and Transformers.

S Bengesi, H El-Sayed, MK Sarker, Y Houkpati… - IEEe …, 2024 - ieeexplore.ieee.org
The launch of ChatGPT in 2022 garnered global attention, marking a significant milestone in
the Generative Artificial Intelligence (GAI) field. While GAI has been in effect for the past …

Promises and challenges of generative artificial intelligence for human learning

L Yan, S Greiff, Z Teuber, D Gašević - Nature Human Behaviour, 2024 - nature.com
Generative artificial intelligence (GenAI) holds the potential to transform the delivery,
cultivation and evaluation of human learning. Here the authors examine the integration of …

The llama 3 herd of models

A Dubey, A Jauhri, A Pandey, A Kadian… - arxiv preprint arxiv …, 2024 - arxiv.org
Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …

Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models

Y Chu, J Xu, X Zhou, Q Yang, S Zhang, Z Yan… - arxiv preprint arxiv …, 2023 - arxiv.org
Recently, instruction-following audio-language models have received broad attention for
audio interaction with humans. However, the absence of pre-trained audio models capable …

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arxiv preprint arxiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Z Ju, Y Wang, K Shen, X Tan, D **n, D Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …

Understanding diffusion objectives as the elbo with simple data augmentation

D Kingma, R Gao - Advances in Neural Information …, 2023 - proceedings.neurips.cc
To achieve the highest perceptual quality, state-of-the-art diffusion models are optimized
with objectives that typically look very different from the maximum likelihood and the …

Seamless: Multilingual Expressive and Streaming Speech Translation

L Barrault, YA Chung, MC Meglioli, D Dale… - arxiv preprint arxiv …, 2023 - arxiv.org
Large-scale automatic speech translation systems today lack key features that help machine-
mediated communication feel seamless when compared to human-to-human dialogue. In …

Xtts: a massively multilingual zero-shot text-to-speech model

E Casanova, K Davis, E Gölge, G Göknar… - arxiv preprint arxiv …, 2024 - arxiv.org
Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language.
Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual …

Vall-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers

S Chen, S Liu, L Zhou, Y Liu, X Tan, J Li, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper introduces VALL-E 2, the latest advancement in neural codec language models
that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity …