- Academic Search

D Stowell - PeerJ, 2022 - peerj.com

Animal vocalisations and natural soundscapes are fascinating objects of study, and contain
valuable evidence about animal behaviours, populations and ecosystems. They are studied …

保存引用被引用数: 272 関連記事全 10 バージョンキャッシュ

[Free GPT-4]

[PDF] arxiv.org

Self-supervised learning for videos: A survey

MC Schiappa, YS Rawat, M Shah - ACM Computing Surveys, 2023 - dl.acm.org

The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …

保存引用被引用数: 153 関連記事全 4 バージョン

[Free GPT-4]

[PDF] thecvf.com

Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

保存引用被引用数: 836 関連記事全 7 バージョン HTMLバージョン

[Free GPT-4]

[PDF] mlr.press

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

R Huang, J Huang, D Yang, Y Ren… - International …, 2023 - proceedings.mlr.press

Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …

保存引用被引用数: 315 関連記事全 7 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Poisoning web-scale training datasets is practical

N Carlini, M Jagielski… - … IEEE Symposium on …, 2024 - ieeexplore.ieee.org

Deep learning models are often trained on distributed, web-scale datasets crawled from the
internet. In this paper, we introduce two new dataset poisoning attacks that intentionally …

保存引用被引用数: 191 関連記事全 6 バージョン

[Free GPT-4]

[PDF] arxiv.org

Socratic models: Composing zero-shot multimodal reasoning with language

A Zeng, M Attarian, B Ichter, K Choromanski… - arxiv preprint arxiv …, 2022 - arxiv.org

Large pretrained (eg," foundation") models exhibit distinct capabilities depending on the
domain of data they are trained on. While these domains are generic, they may only barely …

保存引用被引用数: 524 関連記事全 6 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Y Wu, K Chen, T Zhang, Y Hui… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

Contrastive learning has shown remarkable success in the field of multimodal
representation learning. In this paper, we propose a pipeline of contrastive language-audio …

保存引用被引用数: 518 関連記事全 5 バージョン

[Free GPT-4]

[PDF] thecvf.com

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

保存引用被引用数: 103 関連記事全 3 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

X Mei, C Meng, H Liu, Q Kong, T Ko… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

The advancement of audio-language (AL) multimodal learning tasks has been significant in
recent years, yet the limited size of existing audio-language datasets poses challenges for …

保存引用被引用数: 156 関連記事全 3 バージョン

[Free GPT-4]

[PDF] arxiv.org

Audiogen: Textually guided audio generation

F Kreuk, G Synnaeve, A Polyak, U Singer… - arxiv preprint arxiv …, 2022 - arxiv.org

We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive generative model that generates …

保存引用被引用数: 342 関連記事全 3 バージョン HTMLバージョン

アラートを作成

引用

検索オプション

マイライブラリに保存しました

Vggsound: A large-scale audio-visual dataset

Computational bioacoustics with deep learning: a review and roadmap

Self-supervised learning for videos: A survey

Imagebind: One embedding space to bind them all

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

Poisoning web-scale training datasets is practical

Socratic models: Composing zero-shot multimodal reasoning with language

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

Audiogen: Textually guided audio generation