Towards open vocabulary learning: A survey

J Wu, X Li, S Xu, H Yuan, H Ding… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
In the field of visual scene understanding, deep neural networks have made impressive
advancements in various core tasks like segmentation, tracking, and detection. However …

Implicit temporal modeling with learnable alignment for video recognition

S Tu, Q Dai, Z Wu, ZQ Cheng, H Hu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Contrastive language-image pretraining (CLIP) has demonstrated remarkable success in
various image tasks. However, how to extend CLIP with effective temporal modeling is still …

Open-vocabulary video anomaly detection

P Wu, X Zhou, G Pang, Y Sun, J Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Current video anomaly detection (VAD) approaches with weak supervisions are inherently
limited to a closed-set setting and may struggle in open-world applications where there can …

Improving adversarial robustness of masked autoencoders via test-time frequency-domain prompting

Q Huang, X Dong, D Chen, Y Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this paper, we investigate the adversarial robustness of vision transformers that are
equipped with BERT pretraining (eg, BEiT, MAE). A surprising observation is that MAE has …

HairCLIPv2: Unifying Hair Editing via Proxy Feature Blending

T Wei, D Chen, W Zhou, J Liao… - Proceedings of the …, 2023 - openaccess.thecvf.com
Hair editing has made tremendous progress in recent years. Early hair editing methods use
well-drawn sketches or masks to specify the editing conditions. Even though they can …

Building an open-vocabulary video CLIP model with better architectures, optimization and data

Z Wu, Z Weng, W Peng, X Yang, A Li… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in
zero-shot image recognition, limited effort has been made exploring its potential for zero …

Learning from rich semantics and coarse locations for long-tailed object detection

L Meng, X Dai, J Yang, D Chen… - Advances in …, 2024 - proceedings.neurips.cc
Long-tailed object detection (LTOD) aims to handle the extreme data imbalance in real-
world datasets, where many tail classes have scarce instances. One popular strategy is to …

Chartreader: A unified framework for chart derendering and comprehension without heuristic rules

ZQ Cheng, Q Dai… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Charts are a powerful tool for visually conveying complex data, but their comprehension
poses a challenge due to the diverse chart types and intricate components. Existing chart …

3dstyle-diffusion: Pursuing fine-grained text-driven 3d stylization with 2d diffusion models

H Yang, Y Chen, Y Pan, T Yao, Z Chen… - Proceedings of the 31st …, 2023 - dl.acm.org
3D content creation via text-driven stylization has played a fundamental challenge to
multimedia and graphics community. Recent advances of cross-modal foundation models …

Leveraging temporal contextualization for video action recognition

M Kim, D Han, T Kim, B Han - European Conference on Computer Vision, 2024 - Springer
We propose a novel framework for video understanding, called Temporally Contextualized
CLIP (TC-CLIP), which leverages essential temporal information through global interactions …