Improving 2d feature representations by 3d-aware fine-tuning

Y Yue, A Das, F Engelmann, S Tang… - European Conference on …, 2024 - Springer
Current visual foundation models are trained purely on unstructured 2D data, limiting their
understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning …

Stablenormal: Reducing diffusion variance for stable and sharp normal

C Ye, L Qiu, X Gu, Q Zuo, Y Wu, Z Dong, L Bo… - ACM Transactions on …, 2024 - dl.acm.org
This work addresses the challenge of high-quality surface normal estimation from monocular
colored inputs (ie, images and videos), a field which has recently been revolutionized by …

Real-time 4k super-resolution of compressed AVIF images. AIS 2024 challenge survey

MV Conde, Z Lei, W Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
This paper introduces a novel benchmark for efficient image upscaling as part of the AIS
2024 Real-Time Image Super-Resolution (RTSR) Challenge which aims to upscale …

Lift: A surprisingly simple lightweight feature transform for dense vit descriptors

S Suri, M Walmer, K Gupta, A Shrivastava - European Conference on …, 2024 - Springer
We present a simple self-supervised method to enhance the performance of ViT features for
dense downstream tasks. Our Lightweight Feature Transform (LiFT) is a straightforward and …

Segearth-ov: Towards training-free open-vocabulary segmentation for remote sensing images

K Li, R Liu, X Cao, X Bai, F Zhou, D Meng… - arxiv preprint arxiv …, 2024 - arxiv.org
Remote sensing image plays an irreplaceable role in fields such as agriculture, water
resources, military, and disaster relief. Pixel-level interpretation is a critical aspect of remote …

LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

Y Zhang, Y Liu, Z Guo, Y Zhang, X Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
In multimodal large language models (MLLMs), vision transformers (ViTs) are widely
employed for visual encoding. However, their performance in solving universal MLLM tasks …

Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation

H Jeong, CHP Huang, JC Ye, N Mitra… - arxiv preprint arxiv …, 2024 - arxiv.org
While recent foundational video generators produce visually rich output, they still struggle
with appearance drift, where objects gradually degrade or change inconsistently across …

Keypoint Abstraction using Large Models for Object-Relative Imitation Learning

X Fang, BR Huang, J Mao, J Shone… - arxiv preprint arxiv …, 2024 - arxiv.org
Generalization to novel object configurations and instances across diverse tasks and
environments is a critical challenge in robotics. Keypoint-based representations have been …

Sampart3d: Segment any part in 3d objects

Y Yang, Y Huang, YC Guo, L Lu, X Wu, EY Lam… - arxiv preprint arxiv …, 2024 - arxiv.org
3D part segmentation is a crucial and challenging task in 3D perception, playing a vital role
in applications such as robotics, 3D generation, and 3D editing. Recent methods harness …

A refreshed similarity-based upsampler for direct high-ratio feature upsampling

M Zhou, H Wang, Y Zheng, D Meng - arxiv preprint arxiv:2407.02283, 2024 - arxiv.org
Feature upsampling is a fundamental and indispensable ingredient of almost all current
network structures for image segmentation tasks. Recently, a popular similarity-based …