Self-chained image-language model for video localization and question answering
Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …
models for video question answering. While these image-language models can efficiently …
Language models with image descriptors are strong few-shot video-language learners
The goal of this work is to build flexible video-language models that can generalize to
various video-to-text tasks from few examples. Existing few-shot video-language learners …
various video-to-text tasks from few examples. Existing few-shot video-language learners …
Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning
Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract
reasoning ability is the goal of next-generation AI. Recent advancements in Large Language …
reasoning ability is the goal of next-generation AI. Recent advancements in Large Language …
Multimodal Pre-training for Sequential Recommendation via Contrastive Learning
Sequential recommendation systems often suffer from data sparsity, leading to suboptimal
performance. While multimodal content, such as images and text, has been utilized to …
performance. While multimodal content, such as images and text, has been utilized to …
Language models are free boosters for biomedical imaging tasks
In this study, we uncover the unexpected efficacy of residual-based large language models
(LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of …
(LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of …
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
This paper addresses the task of video question answering (videoQA) via a decomposed
multi-stage modular reasoning framework. Previous modular methods have shown promise …
multi-stage modular reasoning framework. Previous modular methods have shown promise …
Learning to decompose visual features with latent textual prompts
Recent advances in pre-training vision-language models like CLIP have shown great
potential in learning transferable visual representations. Nonetheless, for downstream …
potential in learning transferable visual representations. Nonetheless, for downstream …
Referring atomic video action recognition
We introduce a new task called R eferring A tomic V ideo A ction R ecognition (RAVAR),
aimed at identifying atomic actions of a particular person based on a textual description and …
aimed at identifying atomic actions of a particular person based on a textual description and …
Defining a new NLP playground
The recent explosion of performance of large language models (LLMs) has changed the
field of Natural Language Processing (NLP) more abruptly and seismically than any other …
field of Natural Language Processing (NLP) more abruptly and seismically than any other …
Vurf: A general-purpose reasoning and self-refinement framework for video understanding
Recent studies have demonstrated the effectiveness of Large Language Models (LLMs) as
reasoning modules that can deconstruct complex tasks into more manageable sub-tasks …
reasoning modules that can deconstruct complex tasks into more manageable sub-tasks …