- Academic Search

Z Sun, Q Ke, H Rahmani, M Bennamoun… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …

Spara Citera Citerat av 639 Relaterade artiklar Alla 16 versionerna

[Free GPT-4]

[PDF] peerj.com

Computational bioacoustics with deep learning: a review and roadmap

D Stowell - PeerJ, 2022 - peerj.com

Animal vocalisations and natural soundscapes are fascinating objects of study, and contain
valuable evidence about animal behaviours, populations and ecosystems. They are studied …

Spara Citera Citerat av 273 Relaterade artiklar Alla 10 versionerna Cachad

[Free GPT-4]

[PDF] arxiv.org

Audioldm: Text-to-audio generation with latent diffusion models

H Liu, Z Chen, Y Yuan, X Mei, X Liu, D Mandic… - arxiv preprint arxiv …, 2023 - arxiv.org

Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general
audio based on text descriptions. However, previous studies in TTA have limited generation …

Spara Citera Citerat av 535 Relaterade artiklar Alla 7 versionerna Se som HTML-version

[Free GPT-4]

[PDF] arxiv.org

Videopoet: A large language model for zero-shot video generation

D Kondratyuk, L Yu, X Gu, J Lezama, J Huang… - arxiv preprint arxiv …, 2023 - arxiv.org

We present VideoPoet, a language model capable of synthesizing high-quality video, with
matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder …

Spara Citera Citerat av 177 Relaterade artiklar Alla 5 versionerna Se som HTML-version

[Free GPT-4]

[PDF] neurips.cc

Masked autoencoders that listen

PY Huang, H Xu, J Li, A Baevski… - Advances in …, 2022 - proceedings.neurips.cc

This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-
supervised representation learning from audio spectrograms. Following the Transformer …

Spara Citera Citerat av 254 Relaterade artiklar Alla 5 versionerna Se som HTML-version

[Free GPT-4]

[PDF] neurips.cc

Attention bottlenecks for multimodal fusion

A Nagrani, S Yang, A Arnab, A Jansen… - Advances in neural …, 2021 - proceedings.neurips.cc

Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …

Spara Citera Citerat av 643 Relaterade artiklar Alla 8 versionerna Se som HTML-version

[Free GPT-4]

[PDF] arxiv.org

Noise2music: Text-conditioned music generation with diffusion models

Q Huang, DS Park, T Wang, TI Denk, A Ly… - arxiv preprint arxiv …, 2023 - arxiv.org

We introduce Noise2Music, where a series of diffusion models is trained to generate high-
quality 30-second music clips from text prompts. Two types of diffusion models, a generator …

Spara Citera Citerat av 188 Relaterade artiklar Alla 5 versionerna Se som HTML-version

[Free GPT-4]

[PDF] acm.org

Contrastive learning for cold-start recommendation

Y Wei, X Wang, Q Li, L Nie, Y Li, X Li… - Proceedings of the 29th …, 2021 - dl.acm.org

Recommending purely cold-start items is a long-standing and fundamental challenge in the
recommender systems. Without any historical interaction on cold-start items, the …

Spara Citera Citerat av 303 Relaterade artiklar Alla 5 versionerna

[Free GPT-4]

[PDF] arxiv.org

Fsd50k: an open dataset of human-labeled sound events

E Fonseca, X Favory, J Pons, F Font… - IEEE/ACM Transactions …, 2021 - ieeexplore.ieee.org

Most existing datasets for sound event recognition (SER) are relatively small and/or domain-
specific, with the exception of AudioSet, based on over 2 M tracks from YouTube videos and …

Spara Citera Citerat av 524 Relaterade artiklar Alla 5 versionerna

[Free GPT-4]

[PDF] arxiv.org

Multi-modal transformer for video retrieval

V Gabeur, C Sun, K Alahari, C Schmid - … 28, 2020, Proceedings, Part IV 16, 2020 - Springer

The task of retrieving video content relevant to natural language queries plays a critical role
in effectively handling internet-scale datasets. Most of the existing methods for this caption-to …

Spara Citera Citerat av 743 Relaterade artiklar Alla 13 versionerna

Skapa alarm

Citera

Avancerad sökning

Har sparats i Mitt bibliotek

CNN architectures for large-scale audio classification

Human action recognition from various data modalities: A review

Computational bioacoustics with deep learning: a review and roadmap

Audioldm: Text-to-audio generation with latent diffusion models

Videopoet: A large language model for zero-shot video generation

Masked autoencoders that listen

Attention bottlenecks for multimodal fusion

Noise2music: Text-conditioned music generation with diffusion models

Contrastive learning for cold-start recommendation

Fsd50k: an open dataset of human-labeled sound events

Multi-modal transformer for video retrieval