Compositional chain-of-thought prompting for large multimodal models
The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …
Teaching structured vision & language concepts to vision & language models
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in
a variety of tasks. However, some aspects of complex language understanding still remain a …
a variety of tasks. However, some aspects of complex language understanding still remain a …
Dense and aligned captions (dac) promote compositional reasoning in vl models
Vision and Language (VL) models offer an effective method for aligning representation
spaces of images and text allowing for numerous applications such as cross-modal retrieval …
spaces of images and text allowing for numerous applications such as cross-modal retrieval …
Object-region video transformers
Recently, video transformers have shown great success in video understanding, exceeding
CNN performance; yet existing video transformer models do not explicitly model objects …
CNN performance; yet existing video transformer models do not explicitly model objects …
Learning to generate scene graph from natural language supervision
Learning from image-text data has demonstrated recent success for many recognition tasks,
yet is currently limited to visual features or individual visual concepts such as objects. In this …
yet is currently limited to visual features or individual visual concepts such as objects. In this …
Incorporating structured representations into pretrained vision & language models using scene graphs
Vision and language models (VLMs) have demonstrated remarkable zero-shot (ZS)
performance in a variety of tasks. However, recent works have shown that even the best …
performance in a variety of tasks. However, recent works have shown that even the best …
Linguistic structures as weak supervision for visual scene graph generation
Prior work in scene graph generation requires categorical supervision at the level of triplets---
subjects and objects, and predicates that relate them, either with or without bounding box …
subjects and objects, and predicates that relate them, either with or without bounding box …
Promptonomyvit: Multi-task prompt learning improves video transformers using synthetic scene data
Action recognition models have achieved impressive results by incorporating scene-level
annotations, such as objects, their relations, 3D structure, and more. However, obtaining …
annotations, such as objects, their relations, 3D structure, and more. However, obtaining …
FETA: Towards specializing foundational models for expert task applications
Abstract Foundational Models (FMs) have demonstrated unprecedented capabilities
including zero-shot learning, high fidelity data synthesis, and out of domain generalization …
including zero-shot learning, high fidelity data synthesis, and out of domain generalization …
Bringing image scene structure to video via frame-clip consistency of object tokens
Recent action recognition models have achieved impressive results by integrating objects,
their locations and interactions. However, obtaining dense structured annotations for each …
their locations and interactions. However, obtaining dense structured annotations for each …