[HTML][HTML] Tracking and map** in medical computer vision: A review

A Schmidt, O Mohareri, S DiMaio, MC Yip… - Medical Image …, 2024 - Elsevier
As computer vision algorithms increase in capability, their applications in clinical systems
will become more pervasive. These applications include: diagnostics, such as colonoscopy …

Deep learning-based depth estimation methods from monocular image and videos: A comprehensive survey

U Rajapaksha, F Sohel, H Laga, D Diepeveen… - ACM Computing …, 2024 - dl.acm.org
Estimating depth from single RGB images and videos is of widespread interest due to its
applications in many areas, including autonomous driving, 3D reconstruction, digital …

Blink: Multimodal large language models can see but not perceive

X Fu, Y Hu, B Li, Y Feng, H Wang, X Lin, D Roth… - … on Computer Vision, 2024 - Springer
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses
on core visual perception abilities not found in other evaluations. Most of the Blink tasks can …

Spatialrgpt: Grounded spatial reasoning in vision-language models

AC Cheng, H Yin, Y Fu, Q Guo… - Advances in …, 2025 - proceedings.neurips.cc
Abstract Vision Language Models (VLMs) have demonstrated remarkable performance in
2D vision and language tasks. However, their ability to reason about spatial arrangements …

Fsgs: Real-time few-shot view synthesis using gaussian splatting

Z Zhu, Z Fan, Y Jiang, Z Wang - European conference on computer vision, 2024 - Springer
Novel view synthesis from limited observations remains a crucial and ongoing challenge. In
the realm of NeRF-based few-shot view synthesis, there is often a trade-off between the …

Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image

X Fu, W Yin, M Hu, K Wang, Y Ma, P Tan… - … on Computer Vision, 2024 - Springer
We introduce GeoWizard, a new generative foundation model designed for estimating
geometric attributes, eg, depth and normals, from single images. While significant research …

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Y Hu, W Shi, X Fu, D Roth… - Advances in …, 2025 - proceedings.neurips.cc
Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry
problems; we mark and circle when reasoning on maps; we use sketches to amplify our …

Zero-shot image editing with reference imitation

X Chen, Y Feng, M Chen, Y Wang… - Advances in …, 2025 - proceedings.neurips.cc
Image editing serves as a practical yet challenging task considering the diverse demands
from users, where one of the hardest parts is to precisely describe how the edited image …

Dreamscene4d: Dynamic multi-object scene generation from monocular videos

WH Chu, L Ke, K Fragkiadaki - Advances in Neural …, 2025 - proceedings.neurips.cc
View-predictive generative models provide strong priors for lifting object-centric images and
videos into 3D and 4D through rendering and score distillation objectives. A question then …

Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation

M Hu, W Yin, C Zhang, Z Cai, X Long… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
We introduce Metric3D v2, a geometric foundation model designed for zero-shot metric
depth and surface normal estimation from single images, critical for accurate 3D recovery …