- Academic Search

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

บันทึก อ้างอิง อ้างโดย95 บทความที่เกี่ยวข้อง

[Free GPT-4]
[DeepSeek]

[PDF] springer.com

Artificial intelligence in the creative industries: a review

N Anantrasirichai, D Bull - Artificial intelligence review, 2022 - Springer

This paper reviews the current state of the art in artificial intelligence (AI) technologies and
applications in the context of the creative industries. A brief background of AI, and …

บันทึก อ้างอิง อ้างโดย674 บทความที่เกี่ยวข้อง ทั้งหมด 11 ฉบับ

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022 - openaccess.thecvf.com

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

บันทึก อ้างอิง อ้างโดย1015 บทความที่เกี่ยวข้อง ทั้งหมด 20 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Pose-controllable talking face generation by implicitly modularized audio-visual representation

H Zhou, Y Sun, W Wu, CC Loy… - Proceedings of the …, 2021 - openaccess.thecvf.com

While accurate lip synchronization has been achieved for arbitrary-subject audio-driven
talking face generation, the problem of how to efficiently drive the head pose remains …

บันทึก อ้างอิง อ้างโดย401 บทความที่เกี่ยวข้อง ทั้งหมด 10 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Localizing visual sounds the hard way

H Chen, W **e, T Afouras, A Nagrani… - Proceedings of the …, 2021 - openaccess.thecvf.com

The objective of this work is to localize sound sources that are visible in a video without
using manual annotations. Our key technical contribution is to show that, by training the …

บันทึก อ้างอิง อ้างโดย221 บทความที่เกี่ยวข้อง ทั้งหมด 8 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Learning hierarchical cross-modal association for co-speech gesture generation

X Liu, Q Wu, H Zhou, Y Xu, R Qian… - Proceedings of the …, 2022 - openaccess.thecvf.com

Generating speech-consistent body and gesture movements is a long-standing problem in
virtual avatar creation. Previous studies often synthesize pose movement in a holistic …

บันทึก อ้างอิง อ้างโดย120 บทความที่เกี่ยวข้อง ทั้งหมด 7 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

An overview of deep-learning-based audio-visual speech enhancement and separation

D Michelsanti, ZH Tan, SX Zhang, Y Xu… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org

Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …

บันทึก อ้างอิง อ้างโดย306 บทความที่เกี่ยวข้อง ทั้งหมด 6 ฉบับ

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Epic-fusion: Audio-visual temporal binding for egocentric action recognition

E Kazakos, A Nagrani, A Zisserman… - Proceedings of the …, 2019 - openaccess.thecvf.com

We focus on multi-modal fusion for egocentric action recognition, and propose a novel
architecture for multi-modal temporal-binding, ie the combination of modalities within a …

บันทึก อ้างอิง อ้างโดย428 บทความที่เกี่ยวข้อง ทั้งหมด 15 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Self-supervised learning of audio-visual objects from video

T Afouras, A Owens, JS Chung, A Zisserman - Computer Vision–ECCV …, 2020 - Springer

Our objective is to transform a video into a set of discrete audio-visual objects using self-
supervised learning. To this end, we introduce a model that uses attention to localize and …

บันทึก อ้างอิง อ้างโดย289 บทความที่เกี่ยวข้อง ทั้งหมด 8 ฉบับ

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Imvotenet: Boosting 3d object detection in point clouds with image votes

CR Qi, X Chen, O Litany… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com

Abstract 3D object detection has seen quick progress thanks to advances in deep learning
on point clouds. A few recent works have even shown state-of-the-art performance with just …

บันทึก อ้างอิง อ้างโดย332 บทความที่เกี่ยวข้อง ทั้งหมด 10 ฉบับ ดูในรูปแบบ HTML

สร้างการแจ้งเตือน

อ้างอิง

การค้นหาขั้นสูง

บันทึกไปยังคลังของฉันแล้ว

2.5 d visual sound

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

Artificial intelligence in the creative industries: a review

Ego4d: Around the world in 3,000 hours of egocentric video

Pose-controllable talking face generation by implicitly modularized audio-visual representation

Localizing visual sounds the hard way

Learning hierarchical cross-modal association for co-speech gesture generation

An overview of deep-learning-based audio-visual speech enhancement and separation

Epic-fusion: Audio-visual temporal binding for egocentric action recognition

Self-supervised learning of audio-visual objects from video

Imvotenet: Boosting 3d object detection in point clouds with image votes