Vitamin: Designing scalable vision models in the vision-language era

J Chen, Q Yu, X Shen, A Yuille… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent breakthroughs in vision-language models (VLMs) start a new page in the vision
community. The VLMs provide stronger and more generalizable feature embeddings …

Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation

M Hu, W Yin, C Zhang, Z Cai, X Long… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
We introduce Metric3D v2, a geometric foundation model designed for zero-shot metric
depth and surface normal estimation from single images, critical for accurate 3D recovery …

Coconut: Modernizing coco segmentation

X Deng, Q Yu, P Wang, X Shen… - Proceedings of the …, 2024 - openaccess.thecvf.com
In recent decades the vision community has witnessed remarkable progress in visual
recognition partially owing to advancements in dataset benchmarks. Notably the established …

Geminifusion: Efficient pixel-wise multimodal fusion for vision transformer

D Jia, J Guo, K Han, H Wu, C Zhang, C Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
Cross-modal transformers have demonstrated superiority in various vision tasks by
effectively integrating different modalities. This paper first critiques prior token exchange …

InvPT++: Inverted pyramid multi-task transformer for visual scene understanding

H Ye, D Xu - IEEE transactions on pattern analysis and …, 2024 - ieeexplore.ieee.org
Multi-task scene understanding aims to design models that can simultaneously predict
several scene understanding tasks with one versatile model. Previous studies typically …

HAPNet: Toward superior RGB-thermal scene parsing via hybrid, asymmetric, and progressive heterogeneous feature fusion

J Li, P Yun, Q Chen, R Fan - arxiv preprint arxiv:2404.03527, 2024 - arxiv.org
Data-fusion networks have shown significant promise for RGB-thermal scene parsing.
However, the majority of existing studies have relied on symmetric duplex encoders for …

3d human reconstruction in the wild with synthetic data using generative models

Y Ge, W Wang, Y Chen, H Chen, C Shen - arxiv preprint arxiv:2403.11111, 2024 - arxiv.org
In this work, we show that synthetic data created by generative models is complementary to
computer graphics (CG) rendered data for achieving remarkable generalization …

Uni-EPM: A Unified Extensible Perception Model Without Labeling Everything

Y Gao, S Mu, S Xu - IEEE Transactions on Intelligent …, 2024 - ieeexplore.ieee.org
Multi-task perception system to simultaneously perceive various kinds of objects is essential
for autonomous driving. Existing perception frameworks always rely on multi-labeled …

Fine-tuned depth-augmented U-Net for enhanced semantic segmentation in indoor autonomous vision systems

HN Tran, TAN Le, NV Nguyen, NT Nguyen… - Journal of Real-Time …, 2025 - Springer
Recent technological advancements have significantly improved indoor autonomous vision
systems (IAVSs), underscoring the critical need to enhance their capability to interpret real …

Enhancing Monocular Depth Estimation with Multi-Source Auxiliary Tasks

A Quercia, E Yildiz, Z Cao, K Krajsek… - arxiv preprint arxiv …, 2025 - arxiv.org
Monocular depth estimation (MDE) is a challenging task in computer vision, often hindered
by the cost and scarcity of high-quality labeled datasets. We tackle this challenge using …