Improving 2d feature representations by 3d-aware fine-tuning
Current visual foundation models are trained purely on unstructured 2D data, limiting their
understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning …
understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning …
Stablenormal: Reducing diffusion variance for stable and sharp normal
This work addresses the challenge of high-quality surface normal estimation from monocular
colored inputs (ie, images and videos), a field which has recently been revolutionized by …
colored inputs (ie, images and videos), a field which has recently been revolutionized by …
Real-time 4k super-resolution of compressed AVIF images. AIS 2024 challenge survey
This paper introduces a novel benchmark for efficient image upscaling as part of the AIS
2024 Real-Time Image Super-Resolution (RTSR) Challenge which aims to upscale …
2024 Real-Time Image Super-Resolution (RTSR) Challenge which aims to upscale …
Lift: A surprisingly simple lightweight feature transform for dense vit descriptors
We present a simple self-supervised method to enhance the performance of ViT features for
dense downstream tasks. Our Lightweight Feature Transform (LiFT) is a straightforward and …
dense downstream tasks. Our Lightweight Feature Transform (LiFT) is a straightforward and …
Segearth-ov: Towards training-free open-vocabulary segmentation for remote sensing images
Remote sensing image plays an irreplaceable role in fields such as agriculture, water
resources, military, and disaster relief. Pixel-level interpretation is a critical aspect of remote …
resources, military, and disaster relief. Pixel-level interpretation is a critical aspect of remote …
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
Y Zhang, Y Liu, Z Guo, Y Zhang, X Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
In multimodal large language models (MLLMs), vision transformers (ViTs) are widely
employed for visual encoding. However, their performance in solving universal MLLM tasks …
employed for visual encoding. However, their performance in solving universal MLLM tasks …
Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation
While recent foundational video generators produce visually rich output, they still struggle
with appearance drift, where objects gradually degrade or change inconsistently across …
with appearance drift, where objects gradually degrade or change inconsistently across …
Keypoint Abstraction using Large Models for Object-Relative Imitation Learning
Generalization to novel object configurations and instances across diverse tasks and
environments is a critical challenge in robotics. Keypoint-based representations have been …
environments is a critical challenge in robotics. Keypoint-based representations have been …
Sampart3d: Segment any part in 3d objects
3D part segmentation is a crucial and challenging task in 3D perception, playing a vital role
in applications such as robotics, 3D generation, and 3D editing. Recent methods harness …
in applications such as robotics, 3D generation, and 3D editing. Recent methods harness …
A refreshed similarity-based upsampler for direct high-ratio feature upsampling
Feature upsampling is a fundamental and indispensable ingredient of almost all current
network structures for image segmentation tasks. Recently, a popular similarity-based …
network structures for image segmentation tasks. Recently, a popular similarity-based …