Self-chained image-language model for video localization and question answering

S Yu, J Cho, P Yadav, M Bansal - Advances in Neural …, 2024 - proceedings.neurips.cc
Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …

Language models with image descriptors are strong few-shot video-language learners

Z Wang, M Li, R Xu, L Zhou, J Lei… - Advances in …, 2022 - proceedings.neurips.cc
The goal of this work is to build flexible video-language models that can generalize to
various video-to-text tasks from few examples. Existing few-shot video-language learners …

Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning

Y Wang, W Chen, X Han, X Lin, H Zhao, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract
reasoning ability is the goal of next-generation AI. Recent advancements in Large Language …

Multimodal Pre-training for Sequential Recommendation via Contrastive Learning

L Zhang, X Zhou, Z Zeng, Z Shen - ACM Transactions on Recommender …, 2024 - dl.acm.org
Sequential recommendation systems often suffer from data sparsity, leading to suboptimal
performance. While multimodal content, such as images and text, has been utilized to …

Language models are free boosters for biomedical imaging tasks

Z Lai, J Wu, S Chen, Y Zhou, A Hovakimyan… - arxiv preprint arxiv …, 2024 - arxiv.org
In this study, we uncover the unexpected efficacy of residual-based large language models
(LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of …

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

J Min, S Buch, A Nagrani, M Cho… - Proceedings of the …, 2024 - openaccess.thecvf.com
This paper addresses the task of video question answering (videoQA) via a decomposed
multi-stage modular reasoning framework. Previous modular methods have shown promise …

Learning to decompose visual features with latent textual prompts

F Wang, M Li, X Lin, H Lv, AG Schwing, H Ji - arxiv preprint arxiv …, 2022 - arxiv.org
Recent advances in pre-training vision-language models like CLIP have shown great
potential in learning transferable visual representations. Nonetheless, for downstream …

Referring atomic video action recognition

K Peng, J Fu, K Yang, D Wen, Y Chen, R Liu… - … on Computer Vision, 2024 - Springer
We introduce a new task called R eferring A tomic V ideo A ction R ecognition (RAVAR),
aimed at identifying atomic actions of a particular person based on a textual description and …

Defining a new NLP playground

S Li, C Han, P Yu, C Edwards, M Li, X Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
The recent explosion of performance of large language models (LLMs) has changed the
field of Natural Language Processing (NLP) more abruptly and seismically than any other …

Vurf: A general-purpose reasoning and self-refinement framework for video understanding

A Mahmood, A Vayani, M Naseer, S Khan… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent studies have demonstrated the effectiveness of Large Language Models (LLMs) as
reasoning modules that can deconstruct complex tasks into more manageable sub-tasks …