[HTML][HTML] Review of large vision models and visual prompt engineering

J Wang, Z Liu, L Zhao, Z Wu, C Ma, S Yu, H Dai… - Meta-Radiology, 2023 - Elsevier
Visual prompt engineering is a fundamental methodology in the field of visual and image
artificial general intelligence. As the development of large vision models progresses, the …

A comprehensive survey on segment anything model for vision and beyond

C Zhang, L Liu, Y Cui, G Huang, W Lin, Y Yang… - arxiv preprint arxiv …, 2023 - arxiv.org
Artificial intelligence (AI) is evolving towards artificial general intelligence, which refers to the
ability of an AI system to perform a wide range of tasks and exhibit a level of intelligence …

Lisa: Reasoning segmentation via large language model

X Lai, Z Tian, Y Chen, Y Li, Y Yuan… - Proceedings of the …, 2024 - openaccess.thecvf.com
Although perception systems have made remarkable advancements in recent years they still
rely on explicit human instruction or pre-defined categories to identify the target objects …

Lerf: Language embedded radiance fields

J Kerr, CM Kim, K Goldberg… - Proceedings of the …, 2023 - openaccess.thecvf.com
Humans describe the physical world using natural language to refer to specific 3D locations
based on a vast range of properties: visual appearance, semantics, abstract associations, or …

Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip

Q Yu, J He, X Deng, X Shen… - Advances in Neural …, 2024 - proceedings.neurips.cc
Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing
objects from an open set of categories in diverse environments. One way to address this …

Video-chatgpt: Towards detailed video understanding via large vision and language models

M Maaz, H Rasheed, S Khan, FS Khan - arxiv preprint arxiv:2306.05424, 2023 - arxiv.org
Conversation agents fueled by Large Language Models (LLMs) are providing a new way to
interact with visual data. While there have been initial attempts for image-based …

Side adapter network for open-vocabulary semantic segmentation

M Xu, Z Zhang, F Wei, H Hu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
This paper presents a new framework for open-vocabulary semantic segmentation with the
pre-trained vision-language model, named SAN. Our approach models the semantic …

Vision-language models for vision tasks: A survey

J Zhang, J Huang, S **, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

Openscene: 3d scene understanding with open vocabularies

S Peng, K Genova, C Jiang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a
model for a single task with supervision. We propose OpenScene, an alternative approach …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …