Sai3d: Segment any instance in 3d scenes
Advancements in 3D instance segmentation have traditionally been tethered to the
availability of annotated datasets limiting their application to a narrow spectrum of object …
availability of annotated datasets limiting their application to a narrow spectrum of object …
CLIP4STR: a simple baseline for scene text recognition with pre-trained vision-language model
Pre-trained vision-language models (VLMs) are the de-facto foundation models for various
downstream tasks. However, scene text recognition methods still prefer backbones pre …
downstream tasks. However, scene text recognition methods still prefer backbones pre …
Retrieving multimodal information for augmented generation: A survey
As Large Language Models (LLMs) become popular, there emerged an important trend of
using multimodality to augment the LLMs' generation ability, which enables LLMs to better …
using multimodality to augment the LLMs' generation ability, which enables LLMs to better …
Guiding image captioning models toward more specific captions
S Kornblith, L Li, Z Wang… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Image captioning is conventionally formulated as the task of generating captions that match
the conditional distribution of reference image-caption pairs. However, reference captions in …
the conditional distribution of reference image-caption pairs. However, reference captions in …
Fusing pre-trained language models with multimodal prompts through reinforcement learning
Abstract Language models are capable of commonsense reasoning: while domain-specific
models can learn from explicit knowledge (eg commonsense graphs [6], ethical norms [25]) …
models can learn from explicit knowledge (eg commonsense graphs [6], ethical norms [25]) …
Zero-shot visual relation detection via composite visual cues from large language models
Pretrained vision-language models, such as CLIP, have demonstrated strong generalization
capabilities, making them promising tools in the realm of zero-shot visual recognition. Visual …
capabilities, making them promising tools in the realm of zero-shot visual recognition. Visual …