Compositional chain-of-thought prompting for large multimodal models

C Mitra, B Huang, T Darrell… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …

Teaching structured vision & language concepts to vision & language models

S Doveh, A Arbelle, S Harary… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in
a variety of tasks. However, some aspects of complex language understanding still remain a …

Dense and aligned captions (dac) promote compositional reasoning in vl models

S Doveh, A Arbelle, S Harary… - Advances in …, 2023 - proceedings.neurips.cc
Vision and Language (VL) models offer an effective method for aligning representation
spaces of images and text allowing for numerous applications such as cross-modal retrieval …

Object-region video transformers

R Herzig, E Ben-Avraham… - Proceedings of the …, 2022 - openaccess.thecvf.com
Recently, video transformers have shown great success in video understanding, exceeding
CNN performance; yet existing video transformer models do not explicitly model objects …

Learning to generate scene graph from natural language supervision

Y Zhong, J Shi, J Yang, C Xu… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Learning from image-text data has demonstrated recent success for many recognition tasks,
yet is currently limited to visual features or individual visual concepts such as objects. In this …

Incorporating structured representations into pretrained vision & language models using scene graphs

R Herzig, A Mendelson, L Karlinsky, A Arbelle… - arxiv preprint arxiv …, 2023 - arxiv.org
Vision and language models (VLMs) have demonstrated remarkable zero-shot (ZS)
performance in a variety of tasks. However, recent works have shown that even the best …

Linguistic structures as weak supervision for visual scene graph generation

K Ye, A Kovashka - … of the IEEE/CVF Conference on …, 2021 - openaccess.thecvf.com
Prior work in scene graph generation requires categorical supervision at the level of triplets---
subjects and objects, and predicates that relate them, either with or without bounding box …

Promptonomyvit: Multi-task prompt learning improves video transformers using synthetic scene data

R Herzig, O Abramovich… - Proceedings of the …, 2024 - openaccess.thecvf.com
Action recognition models have achieved impressive results by incorporating scene-level
annotations, such as objects, their relations, 3D structure, and more. However, obtaining …

FETA: Towards specializing foundational models for expert task applications

A Alfassy, A Arbelle, O Halimi… - Advances in …, 2022 - proceedings.neurips.cc
Abstract Foundational Models (FMs) have demonstrated unprecedented capabilities
including zero-shot learning, high fidelity data synthesis, and out of domain generalization …

Bringing image scene structure to video via frame-clip consistency of object tokens

E Ben Avraham, R Herzig… - Advances in …, 2022 - proceedings.neurips.cc
Recent action recognition models have achieved impressive results by integrating objects,
their locations and interactions. However, obtaining dense structured annotations for each …