- Academic Search

S Wang, Y Zhu, H Liu, Z Zheng, C Chen, J Li - ACM Computing Surveys, 2024 - dl.acm.org

Large Language Models (LLMs) have recently transformed both the academic and industrial
landscapes due to their remarkable capacity to understand, analyze, and generate texts …

Uložit Citovat Počet citací tohoto článku: 103 Související články Všechny verze (počet: 2)

[Free GPT-4]

[PDF] arxiv.org

Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Uložit Citovat Počet citací tohoto článku: 136 Související články Všechny verze (počet: 2)

[Free GPT-4]

[PDF] thecvf.com

Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

Uložit Citovat Počet citací tohoto článku: 840 Související články Všechny verze (počet: 7) Zobrazit jako HTML

[Free GPT-4]

[PDF] arxiv.org

Audioldm: Text-to-audio generation with latent diffusion models

H Liu, Z Chen, Y Yuan, X Mei, X Liu, D Mandic… - arxiv preprint arxiv …, 2023 - arxiv.org

Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general
audio based on text descriptions. However, previous studies in TTA have limited generation …

Uložit Citovat Počet citací tohoto článku: 538 Související články Všechny verze (počet: 7) Zobrazit jako HTML

[Free GPT-4]

[PDF] mlr.press

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

R Huang, J Huang, D Yang, Y Ren… - International …, 2023 - proceedings.mlr.press

Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …

Uložit Citovat Počet citací tohoto článku: 317 Související články Všechny verze (počet: 7) Zobrazit jako HTML

[Free GPT-4]

[PDF] arxiv.org

High fidelity neural audio compression

A Défossez, J Copet, G Synnaeve, Y Adi - arxiv preprint arxiv:2210.13438, 2022 - arxiv.org

We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural
networks. It consists in a streaming encoder-decoder architecture with quantized latent …

Uložit Citovat Počet citací tohoto článku: 702 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]

[PDF] neurips.cc

High-fidelity audio compression with improved rvqgan

R Kumar, P Seetharaman, A Luebs… - Advances in Neural …, 2024 - proceedings.neurips.cc

Abstract Language models have been successfully used to model natural signals, such as
images, speech, and music. A key component of these models is a high quality neural …

Uložit Citovat Počet citací tohoto článku: 243 Související články Všechny verze (počet: 5) Zobrazit jako HTML

[Free GPT-4]

[PDF] neurips.cc

Any-to-any generation via composable diffusion

Z Tang, Z Yang, C Zhu, M Zeng… - Advances in Neural …, 2024 - proceedings.neurips.cc

Abstract We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image, video, or audio …

Uložit Citovat Počet citací tohoto článku: 148 Související články Všechny verze (počet: 8) Zobrazit jako HTML

[Free GPT-4]

[PDF] mlr.press

Data2vec: A general framework for self-supervised learning in speech, vision and language

A Baevski, WN Hsu, Q Xu, A Babu… - … on Machine Learning, 2022 - proceedings.mlr.press

While the general idea of self-supervised learning is identical across modalities, the actual
algorithms and objectives differ widely because they were developed with a single modality …

Uložit Citovat Počet citací tohoto článku: 934 Související články Všechny verze (počet: 6) Zobrazit jako HTML

[Free GPT-4]

[PDF] arxiv.org

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Y Wu, K Chen, T Zhang, Y Hui… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

Contrastive learning has shown remarkable success in the field of multimodal
representation learning. In this paper, we propose a pipeline of contrastive language-audio …

Uložit Citovat Počet citací tohoto článku: 520 Související články Všechny verze (počet: 5)

Vytvořit upozornění

Citovat

Rozšířené vyhledávání

Uloženo do Mojí knihovny

Audio set: An ontology and human-labeled dataset for audio events

Knowledge editing for large language models: A survey

Foundation Models Defining a New Era in Vision: a Survey and Outlook

Imagebind: One embedding space to bind them all

Audioldm: Text-to-audio generation with latent diffusion models

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

High fidelity neural audio compression

High-fidelity audio compression with improved rvqgan

Any-to-any generation via composable diffusion

Data2vec: A general framework for self-supervised learning in speech, vision and language

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation