Audio-visual segmentation by exploring cross-modal mutual semantics

C Liu, PP Li, X Qi, H Zhang, L Li, D Wang… - Proceedings of the 31st …, 2023 - dl.acm.org
The audio-visual segmentation (AVS) task aims to segment sounding objects from a given
video. Existing works mainly focus on fusing audio and visual features of a given video to …

BAVS: bootstrap** audio-visual segmentation by integrating foundation knowledge

C Liu, P Li, H Zhang, L Li, Z Huang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Given an audio-visual pair, audio-visual segmentation (AVS) aims to locate sounding
sources by predicting pixel-wise maps. Previous methods assume that each sound …

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

X Qi, J Pan, P Li, R Yuan, X Chi, M Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation
in human-machine interaction applications. While the existing methods enable generating …

Chain of generation: Multi-modal gesture synthesis via cascaded conditional control

Z Xu, Y Zhang, S Yang, R Li, X Li - … of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org
This study aims to improve the generation of 3D gestures by utilizing multimodal information
from human speech. Previous studies have focused on incorporating additional modalities …

The diffusestylegesture+ entry to the genea challenge 2023

S Yang, H Xue, Z Zhang, M Li, Z Wu, X Wu… - Proceedings of the 25th …, 2023 - dl.acm.org
In this paper, we introduce the DiffuseStyleGesture+, our solution for the Generation and
Evaluation of Non-verbal Behavior for Embodied Agents (GENEA) Challenge 2023, which …

Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis

Z Zhang, T Ao, Y Zhang, Q Gao, C Lin… - ACM Transactions on …, 2024 - dl.acm.org
In this work, we present Semantic Gesticulator, a novel framework designed to synthesize
realistic gestures accompanying speech with strong semantic correspondence. Semantically …

BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics

W Zhang, M Huang, Y Zhou, J Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
The recently emerging text-to-motion advances have spired numerous attempts for
convenient and interactive human motion generation. Yet existing methods are largely …

Mambatalk: Efficient holistic gesture synthesis with selective state space models

Z Xu, Y Lin, H Han, S Yang, R Li, Y Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Gesture synthesis is a vital realm of human-computer interaction, with wide-ranging
applications across various fields like film, robotics, and virtual reality. Recent advancements …

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

H Liu, Z Zhu, G Becherini, Y Peng… - Proceedings of the …, 2024 - openaccess.thecvf.com
We propose EMAGE a framework to generate full-body human gestures from audio and
masked gestures encompassing facial local body hands and global movements. To achieve …

Learning Transferable Compound Expressions from Masked AutoEncoder Pretraining

F Qiu, H Du, W Zhang, C Liu, L Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Video-based Compound Expression Recognition (CER) aims to identify compound
expressions in everyday interactions per frame. Unlike rapid progress in Facial Expression …