Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Learning in audio-visual context: A review, analysis, and new perspective
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …
understanding. To mimic human perception ability, audio-visual learning, aimed at …
Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models
Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …
text-to-video generation. Its application to audio still lags behind for two main reasons: the …
Audioldm 2: Learning holistic audio generation with self-supervised pretraining
Although audio generation shares commonalities across different types of audio, such as
speech, music, and sound effects, designing models for each type requires careful …
speech, music, and sound effects, designing models for each type requires careful …
Diffsound: Discrete diffusion model for text-to-sound generation
Generating sound effects that people want is an important topic. However, there are limited
studies in this area for sound generation. In this study, we investigate generating sound …
studies in this area for sound generation. In this study, we investigate generating sound …
Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners
Video and audio content creation serves as the core technique for the movie industry and
professional users. Recently existing diffusion-based methods tackle video and audio …
professional users. Recently existing diffusion-based methods tackle video and audio …
Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models
Abstract The Video-to-Audio (V2A) model has recently gained attention for its practical
application in generating audio directly from silent videos, particularly in video/film …
application in generating audio directly from silent videos, particularly in video/film …
Hifi-codec: Group-residual vector quantization for high fidelity audio codec
Audio codec models are widely used in audio communication as a crucial technique for
compressing audio into discrete representations. Nowadays, audio codec models are …
compressing audio into discrete representations. Nowadays, audio codec models are …
A survey on audio diffusion models: Text to speech synthesis and enhancement in generative ai
Generative AI has demonstrated impressive performance in various fields, among which
speech synthesis is an interesting direction. With the diffusion model as the most popular …
speech synthesis is an interesting direction. With the diffusion model as the most popular …
Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt
Expressive text-to-speech (TTS) aims to synthesize speech with varying speaking styles to
better reflect human speech patterns. In this study, we attempt to use natural language as a …
better reflect human speech patterns. In this study, we attempt to use natural language as a …
Conditional generation of audio from video via foley analogies
The sound effects that designers add to videos are designed to convey a particular artistic
effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges …
effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges …