Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arxiv preprint arxiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

R Huang, J Huang, D Yang, Y Ren… - International …, 2023 - proceedings.mlr.press
Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …

Audioldm 2: Learning holistic audio generation with self-supervised pretraining

H Liu, Y Yuan, X Liu, X Mei, Q Kong… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Although audio generation shares commonalities across different types of audio, such as
speech, music, and sound effects, designing models for each type requires careful …

Diffsound: Discrete diffusion model for text-to-sound generation

D Yang, J Yu, H Wang, W Wang… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
Generating sound effects that people want is an important topic. However, there are limited
studies in this area for sound generation. In this study, we investigate generating sound …

Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners

Y **ng, Y He, Z Tian, X Wang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Video and audio content creation serves as the core technique for the movie industry and
professional users. Recently existing diffusion-based methods tackle video and audio …

Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models

S Luo, C Yan, C Hu, H Zhao - Advances in Neural …, 2023 - proceedings.neurips.cc
Abstract The Video-to-Audio (V2A) model has recently gained attention for its practical
application in generating audio directly from silent videos, particularly in video/film …

Hifi-codec: Group-residual vector quantization for high fidelity audio codec

D Yang, S Liu, R Huang, J Tian, C Weng… - arxiv preprint arxiv …, 2023 - arxiv.org
Audio codec models are widely used in audio communication as a crucial technique for
compressing audio into discrete representations. Nowadays, audio codec models are …

A survey on audio diffusion models: Text to speech synthesis and enhancement in generative ai

C Zhang, C Zhang, S Zheng, M Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
Generative AI has demonstrated impressive performance in various fields, among which
speech synthesis is an interesting direction. With the diffusion model as the most popular …

Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt

D Yang, S Liu, R Huang, C Weng… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org
Expressive text-to-speech (TTS) aims to synthesize speech with varying speaking styles to
better reflect human speech patterns. In this study, we attempt to use natural language as a …

Conditional generation of audio from video via foley analogies

Y Du, Z Chen, J Salamon, B Russell… - Proceedings of the …, 2023 - openaccess.thecvf.com
The sound effects that designers add to videos are designed to convey a particular artistic
effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges …