Compositional chain-of-thought prompting for large multimodal models

C Mitra, B Huang, T Darrell… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …

Teaching structured vision & language concepts to vision & language models

S Doveh, A Arbelle, S Harary… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in
a variety of tasks. However, some aspects of complex language understanding still remain a …

Dense and aligned captions (dac) promote compositional reasoning in vl models

S Doveh, A Arbelle, S Harary… - Advances in …, 2023 - proceedings.neurips.cc
Vision and Language (VL) models offer an effective method for aligning representation
spaces of images and text allowing for numerous applications such as cross-modal retrieval …

Hel** hands: An object-aware ego-centric video recognition model

C Zhang, A Gupta, A Zisserman - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
We introduce an object-aware decoder for improving the performance of spatio-temporal
representations on ego-centric videos. The key idea is to enhance object-awareness during …

Incorporating structured representations into pretrained vision & language models using scene graphs

R Herzig, A Mendelson, L Karlinsky, A Arbelle… - arxiv preprint arxiv …, 2023 - arxiv.org
Vision and language models (VLMs) have demonstrated remarkable zero-shot (ZS)
performance in a variety of tasks. However, recent works have shown that even the best …

Learning correlation structures for vision transformers

M Kim, PH Seo, C Schmid… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
We introduce a new attention mechanism dubbed structural self-attention (StructSA) that
leverages rich correlation patterns naturally emerging in key-query interactions of attention …

Promptonomyvit: Multi-task prompt learning improves video transformers using synthetic scene data

R Herzig, O Abramovich… - Proceedings of the …, 2024 - openaccess.thecvf.com
Action recognition models have achieved impressive results by incorporating scene-level
annotations, such as objects, their relations, 3D structure, and more. However, obtaining …

Multimodal task vectors enable many-shot multimodal in-context learning

B Huang, C Mitra, A Arbelle, L Karlinsky… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning
suggests that in-context learning (ICL) with many examples can be promising for learning …

Egocentric video task translation

Z Xue, Y Song, K Grauman… - Proceedings of the …, 2023 - openaccess.thecvf.com
Different video understanding tasks are typically treated in isolation, and even with distinct
types of curated data (eg, classifying sports in one dataset, tracking animals in another) …

Incorporating Scene Graphs into Pre-trained Vision-Language Models for Multimodal Open-vocabulary Action Recognition

C Wei, Z Deng - 2024 IEEE International Conference on …, 2024 - ieeexplore.ieee.org
This paper presents Action-SGFA, a novel action feature alignment approach to learn unified
joint embeddings across four action modalities incorporating scene graph (SG) …