Compositional chain-of-thought prompting for large multimodal models
The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …
Teaching structured vision & language concepts to vision & language models
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in
a variety of tasks. However, some aspects of complex language understanding still remain a …
a variety of tasks. However, some aspects of complex language understanding still remain a …
Dense and aligned captions (dac) promote compositional reasoning in vl models
Vision and Language (VL) models offer an effective method for aligning representation
spaces of images and text allowing for numerous applications such as cross-modal retrieval …
spaces of images and text allowing for numerous applications such as cross-modal retrieval …
Hel** hands: An object-aware ego-centric video recognition model
We introduce an object-aware decoder for improving the performance of spatio-temporal
representations on ego-centric videos. The key idea is to enhance object-awareness during …
representations on ego-centric videos. The key idea is to enhance object-awareness during …
Incorporating structured representations into pretrained vision & language models using scene graphs
Vision and language models (VLMs) have demonstrated remarkable zero-shot (ZS)
performance in a variety of tasks. However, recent works have shown that even the best …
performance in a variety of tasks. However, recent works have shown that even the best …
Learning correlation structures for vision transformers
We introduce a new attention mechanism dubbed structural self-attention (StructSA) that
leverages rich correlation patterns naturally emerging in key-query interactions of attention …
leverages rich correlation patterns naturally emerging in key-query interactions of attention …
Promptonomyvit: Multi-task prompt learning improves video transformers using synthetic scene data
Action recognition models have achieved impressive results by incorporating scene-level
annotations, such as objects, their relations, 3D structure, and more. However, obtaining …
annotations, such as objects, their relations, 3D structure, and more. However, obtaining …
Multimodal task vectors enable many-shot multimodal in-context learning
The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning
suggests that in-context learning (ICL) with many examples can be promising for learning …
suggests that in-context learning (ICL) with many examples can be promising for learning …
Egocentric video task translation
Different video understanding tasks are typically treated in isolation, and even with distinct
types of curated data (eg, classifying sports in one dataset, tracking animals in another) …
types of curated data (eg, classifying sports in one dataset, tracking animals in another) …
Incorporating Scene Graphs into Pre-trained Vision-Language Models for Multimodal Open-vocabulary Action Recognition
C Wei, Z Deng - 2024 IEEE International Conference on …, 2024 - ieeexplore.ieee.org
This paper presents Action-SGFA, a novel action feature alignment approach to learn unified
joint embeddings across four action modalities incorporating scene graph (SG) …
joint embeddings across four action modalities incorporating scene graph (SG) …