A review of modern recommender systems using generative models (gen-recsys)

Y Deldjoo, Z He, J McAuley, A Korikov… - Proceedings of the 30th …, 2024 - dl.acm.org
Traditional recommender systems typically use user-item rating histories as their main data
source. However, deep generative models now have the capability to model and sample …

Tabpedia: Towards comprehensive visual table understanding with concept synergy

W Zhao, H Feng, Q Liu, J Tang, B Wu… - Advances in …, 2025 - proceedings.neurips.cc
Tables contain factual and quantitative data accompanied by various structures and
contents that pose challenges for machine comprehension. Previous methods generally …

Recommendation with generative models

Y Deldjoo, Z He, J McAuley, A Korikov… - arxiv preprint arxiv …, 2024 - arxiv.org
Generative models are a class of AI models capable of creating new instances of data by
learning and sampling from their statistical distributions. In recent years, these models have …

Leveraging temporal contextualization for video action recognition

M Kim, D Han, T Kim, B Han - European Conference on Computer Vision, 2024 - Springer
We propose a novel framework for video understanding, called Temporally Contextualized
CLIP (TC-CLIP), which leverages essential temporal information through global interactions …

Rethinking clip-based video learners in cross-domain open-vocabulary action recognition

KY Lin, H Ding, J Zhou, YM Tang, YX Peng… - arxiv preprint arxiv …, 2024 - arxiv.org
Building upon the impressive success of CLIP (Contrastive Language-Image Pretraining),
recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to …

Foundation models for video understanding: A survey

N Madan, A Møgelmose, R Modi, YS Rawat… - Authorea …, 2024 - techrxiv.org
Video Foundation Models (ViFMs) aim to develop general-purpose representations for
various video understanding tasks by leveraging large-scale datasets and powerful models …

Awt: Transferring vision-language models via augmentation, weighting, and transportation

Y Zhu, Y Ji, Z Zhao, G Wu, L Wang - arxiv preprint arxiv:2407.04603, 2024 - arxiv.org
Pre-trained vision-language models (VLMs) have shown impressive results in various visual
classification tasks. However, we often fail to fully unleash their potential when adapting …

Llavidal: Benchmarking large language vision models for daily activities of living

R Chakraborty, A Sinha, D Reilly, MK Govind… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Vision Models (LLVMs) have demonstrated effectiveness in processing
internet videos, yet they struggle with the visually perplexing dynamics present in Activities …

Mote: Reconciling generalization with specialization for visual-language to video knowledge transfer

M Zhu, Z Wang, M Hu, R Dang, X Lin, X Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
Transferring visual-language knowledge from large-scale foundation models for video
recognition has proved to be effective. To bridge the domain gap, additional parametric …

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

H Wang, C Ju, W Lin, S **ao, M Chen, Y Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-
training (CLIP) has made significant strides, becoming foundation for various downstream …