Lerf: Language embedded radiance fields

J Kerr, CM Kim, K Goldberg… - Proceedings of the …, 2023 - openaccess.thecvf.com
Humans describe the physical world using natural language to refer to specific 3D locations
based on a vast range of properties: visual appearance, semantics, abstract associations, or …

Langsplat: 3d language gaussian splatting

M Qin, W Li, J Zhou, H Wang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Humans live in a 3D world and commonly use natural language to interact with a 3D scene.
Modeling a 3D language field to support open-ended language queries in 3D has gained …

Task me anything

J Zhang, W Huang, Z Ma, O Michel, D He… - arxiv preprint arxiv …, 2024 - arxiv.org
Benchmarks for large multimodal language models (MLMs) now serve to simultaneously
assess the general capabilities of models instead of evaluating for a specific capability. As a …

Going beyond nouns with vision & language models using synthetic data

P Cascante-Bonilla, K Shehada… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large-scale pre-trained Vision & Language (VL) models have shown remarkable
performance in many applications, enabling replacing a fixed set of supported classes with …

Language-driven grasp detection

AD Vuong, MN Vu, B Huang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Grasp detection is a persistent and intricate challenge with various industrial applications.
Recently many methods and datasets have been proposed to tackle the grasp detection …

Fmgs: Foundation model embedded 3d gaussian splatting for holistic 3d scene understanding

X Zuo, P Samangouei, Y Zhou, Y Di, M Li - International Journal of …, 2024 - Springer
Precisely perceiving the geometric and semantic properties of real-world 3D objects is
crucial for the continued evolution of augmented reality and robotic applications. To this end …

Prompt-guided zero-shot anomaly action recognition using pretrained deep skeleton features

F Sato, R Hachiuma, T Sekii - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
This study investigates unsupervised anomaly action recognition, which identifies video-
level abnormal-human-behavior events in an unsupervised manner without abnormal …

Swapmix: Diagnosing and regularizing the over-reliance on visual context in visual question answering

V Gupta, Z Li, A Kortylewski, C Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract While Visual Question Answering (VQA) has progressed rapidly, previous works
raise concerns about robustness of current VQA models. In this work, we study the …

Dual learning with dynamic knowledge distillation for partially relevant video retrieval

J Dong, M Zhang, Z Zhang, X Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Almost all previous text-to-video retrieval works assume that videos are pre-trimmed with
short durations. However, in practice, videos are generally untrimmed containing much …

Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering

J Wang, Z Zheng, Z Chen, A Ma, Y Zhong - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Earth vision research typically focuses on extracting geospatial object locations and
categories but neglects the exploration of relations between objects and comprehensive …