Simda: Simple diffusion adapter for efficient video generation
The recent wave of AI-generated content has witnessed the great development and success
of Text-to-Image (T2I) technologies. By contrast Text-to-Video (T2V) still falls short of …
of Text-to-Image (T2I) technologies. By contrast Text-to-Video (T2V) still falls short of …
Svformer: Semi-supervised video transformer for action recognition
Semi-supervised action recognition is a challenging but critical task due to the high cost of
video annotations. Existing approaches mainly use convolutional neural networks, yet …
video annotations. Existing approaches mainly use convolutional neural networks, yet …
XVO: Generalized visual odometry via cross-modal self-training
We propose XVO, a semi-supervised learning method for training generalized monocular
Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and …
Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and …
Panoswin: a pano-style swin transformer for panorama understanding
In panorama understanding, the widely used equirectangular projection (ERP) entails
boundary discontinuity and spatial distortion. It severely deteriorates the conventional CNNs …
boundary discontinuity and spatial distortion. It severely deteriorates the conventional CNNs …
Few-shot single-view 3d reconstruction with memory prior contrastive network
Abstract 3D reconstruction of novel categories based on few-shot learning is appealing in
real-world applications and attracts increasing research interests. Previous approaches …
real-world applications and attracts increasing research interests. Previous approaches …
Chasing consistency in text-to-3d generation from a single image
Text-to-3D generation from a single-view image is a popular but challenging task in 3D
vision. Although numerous methods have been proposed, existing works still suffer from the …
vision. Although numerous methods have been proposed, existing works still suffer from the …
Vidiff: Translating videos via multi-modal instructions with diffusion models
Diffusion models have achieved significant success in image and video generation. This
motivates a growing interest in video editing tasks, where videos are edited according to …
motivates a growing interest in video editing tasks, where videos are edited according to …
Garnet: Global-aware multi-view 3d reconstruction network and the cost-performance tradeoff
Deep learning technology has made great progress in multi-view 3D reconstruction tasks. At
present, the mainstream solutions adopt different ways to fusion the features from several …
present, the mainstream solutions adopt different ways to fusion the features from several …
Umiformer: Mining the correlations between similar tokens for multi-view 3d reconstruction
In recent years, many video tasks have achieved breakthroughs by utilizing the vision
transformer and establishing spatial-temporal decoupling for feature extraction. Although …
transformer and establishing spatial-temporal decoupling for feature extraction. Although …
Fdgaussian: Fast gaussian splatting from single image via geometric-aware diffusion model
Reconstructing detailed 3D objects from single-view images remains a challenging task due
to the limited information available. In this paper, we introduce FDGaussian, a novel two …
to the limited information available. In this paper, we introduce FDGaussian, a novel two …