Audio-visual segmentation by exploring cross-modal mutual semantics
The audio-visual segmentation (AVS) task aims to segment sounding objects from a given
video. Existing works mainly focus on fusing audio and visual features of a given video to …
video. Existing works mainly focus on fusing audio and visual features of a given video to …
BAVS: bootstrap** audio-visual segmentation by integrating foundation knowledge
Given an audio-visual pair, audio-visual segmentation (AVS) aims to locate sounding
sources by predicting pixel-wise maps. Previous methods assume that each sound …
sources by predicting pixel-wise maps. Previous methods assume that each sound …
Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation
Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation
in human-machine interaction applications. While the existing methods enable generating …
in human-machine interaction applications. While the existing methods enable generating …
Chain of generation: Multi-modal gesture synthesis via cascaded conditional control
This study aims to improve the generation of 3D gestures by utilizing multimodal information
from human speech. Previous studies have focused on incorporating additional modalities …
from human speech. Previous studies have focused on incorporating additional modalities …
The diffusestylegesture+ entry to the genea challenge 2023
In this paper, we introduce the DiffuseStyleGesture+, our solution for the Generation and
Evaluation of Non-verbal Behavior for Embodied Agents (GENEA) Challenge 2023, which …
Evaluation of Non-verbal Behavior for Embodied Agents (GENEA) Challenge 2023, which …
Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis
In this work, we present Semantic Gesticulator, a novel framework designed to synthesize
realistic gestures accompanying speech with strong semantic correspondence. Semantically …
realistic gestures accompanying speech with strong semantic correspondence. Semantically …
BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics
The recently emerging text-to-motion advances have spired numerous attempts for
convenient and interactive human motion generation. Yet existing methods are largely …
convenient and interactive human motion generation. Yet existing methods are largely …
Mambatalk: Efficient holistic gesture synthesis with selective state space models
Gesture synthesis is a vital realm of human-computer interaction, with wide-ranging
applications across various fields like film, robotics, and virtual reality. Recent advancements …
applications across various fields like film, robotics, and virtual reality. Recent advancements …
EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
We propose EMAGE a framework to generate full-body human gestures from audio and
masked gestures encompassing facial local body hands and global movements. To achieve …
masked gestures encompassing facial local body hands and global movements. To achieve …
Learning Transferable Compound Expressions from Masked AutoEncoder Pretraining
Abstract Video-based Compound Expression Recognition (CER) aims to identify compound
expressions in everyday interactions per frame. Unlike rapid progress in Facial Expression …
expressions in everyday interactions per frame. Unlike rapid progress in Facial Expression …