Videomamba: State space model for efficient video understanding
Addressing the dual challenges of local redundancy and global dependencies in video
understanding, this work innovatively adapts the Mamba to the video domain. The proposed …
understanding, this work innovatively adapts the Mamba to the video domain. The proposed …
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …
Mamba-nd: Selective state space modeling for multi-dimensional data
In recent years, Transformers have become the de-facto architecture for sequence modeling
on text and multi-dimensional data, such as images and video. However, the use of self …
on text and multi-dimensional data, such as images and video. However, the use of self …
Language models with image descriptors are strong few-shot video-language learners
The goal of this work is to build flexible video-language models that can generalize to
various video-to-text tasks from few examples. Existing few-shot video-language learners …
various video-to-text tasks from few examples. Existing few-shot video-language learners …
Verbs in action: Improving verb understanding in video-language models
Understanding verbs is crucial to modelling how people and objects interact with each other
and the environment through space and time. Recently, state-of-the-art video-language …
and the environment through space and time. Recently, state-of-the-art video-language …
How to index item ids for recommendation foundation models
Recommendation foundation model utilizes large language models (LLM) for
recommendation by converting recommendation tasks into natural language tasks. It …
recommendation by converting recommendation tasks into natural language tasks. It …
Long-form video-language pre-training with multimodal temporal contrastive learning
Large-scale video-language pre-training has shown significant improvement in video-
language understanding tasks. Previous studies of video-language pretraining mainly focus …
language understanding tasks. Previous studies of video-language pretraining mainly focus …
Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges
Sequence modeling is a crucial area across various domains, including Natural Language
Processing (NLP), speech recognition, time series forecasting, music generation, and …
Processing (NLP), speech recognition, time series forecasting, music generation, and …
Video-mined task graphs for keystep recognition in instructional videos
Procedural activity understanding requires perceiving human actions in terms of a broader
task, where multiple keysteps are performed in sequence across a long video to reach a …
task, where multiple keysteps are performed in sequence across a long video to reach a …
Long movie clip classification with state-space video models
Most modern video recognition models are designed to operate on short video clips (eg, 5–
10 s in length). Thus, it is challenging to apply such models to long movie understanding …
10 s in length). Thus, it is challenging to apply such models to long movie understanding …