A systematic survey of prompt engineering in large language models: Techniques and applications
Prompt engineering has emerged as an indispensable technique for extending the
capabilities of large language models (LLMs) and vision-language models (VLMs). This …
capabilities of large language models (LLMs) and vision-language models (VLMs). This …
Vision-language models for vision tasks: A survey
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …
(DNNs) training, and they usually train a DNN for each single visual recognition task …
Open-vocabulary semantic segmentation with mask-adapted clip
Open-vocabulary semantic segmentation aims to segment an image into semantic regions
according to text descriptions, which may not have been seen during training. Recent two …
according to text descriptions, which may not have been seen during training. Recent two …
Visual prompt multi-modal tracking
Visible-modal object tracking gives rise to a series of downstream multi-modal tracking
tributaries. To inherit the powerful representations of the foundation model, a natural modus …
tributaries. To inherit the powerful representations of the foundation model, a natural modus …
Vision transformer adapter for dense predictions
This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike
recent visual transformers that introduce vision-specific inductive biases into their …
recent visual transformers that introduce vision-specific inductive biases into their …
Repurposing diffusion-based image generators for monocular depth estimation
Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth
from a single image is geometrically ill-posed and requires scene understanding so it is not …
from a single image is geometrically ill-posed and requires scene understanding so it is not …
Simda: Simple diffusion adapter for efficient video generation
The recent wave of AI-generated content has witnessed the great development and success
of Text-to-Image (T2I) technologies. By contrast Text-to-Video (T2V) still falls short of …
of Text-to-Image (T2I) technologies. By contrast Text-to-Video (T2V) still falls short of …
Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching
Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects
from novel categories beyond the base categories on which the detector is trained. Recent …
from novel categories beyond the base categories on which the detector is trained. Recent …
Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization
The promising zero-shot generalization of vision-language models such as CLIP has led to
their adoption using prompt learning for numerous downstream tasks. Previous works have …
their adoption using prompt learning for numerous downstream tasks. Previous works have …
Towards large-scale 3d representation learning with multi-dataset point prompt training
The rapid advancement of deep learning models is often attributed to their ability to leverage
massive training data. In contrast such privilege has not yet fully benefited 3D deep learning …
massive training data. In contrast such privilege has not yet fully benefited 3D deep learning …