Towards open vocabulary learning: A survey
In the field of visual scene understanding, deep neural networks have made impressive
advancements in various core tasks like segmentation, tracking, and detection. However …
advancements in various core tasks like segmentation, tracking, and detection. However …
Implicit temporal modeling with learnable alignment for video recognition
Contrastive language-image pretraining (CLIP) has demonstrated remarkable success in
various image tasks. However, how to extend CLIP with effective temporal modeling is still …
various image tasks. However, how to extend CLIP with effective temporal modeling is still …
Open-vocabulary video anomaly detection
Current video anomaly detection (VAD) approaches with weak supervisions are inherently
limited to a closed-set setting and may struggle in open-world applications where there can …
limited to a closed-set setting and may struggle in open-world applications where there can …
Improving adversarial robustness of masked autoencoders via test-time frequency-domain prompting
In this paper, we investigate the adversarial robustness of vision transformers that are
equipped with BERT pretraining (eg, BEiT, MAE). A surprising observation is that MAE has …
equipped with BERT pretraining (eg, BEiT, MAE). A surprising observation is that MAE has …
HairCLIPv2: Unifying Hair Editing via Proxy Feature Blending
Hair editing has made tremendous progress in recent years. Early hair editing methods use
well-drawn sketches or masks to specify the editing conditions. Even though they can …
well-drawn sketches or masks to specify the editing conditions. Even though they can …
Building an open-vocabulary video CLIP model with better architectures, optimization and data
Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in
zero-shot image recognition, limited effort has been made exploring its potential for zero …
zero-shot image recognition, limited effort has been made exploring its potential for zero …
Learning from rich semantics and coarse locations for long-tailed object detection
Long-tailed object detection (LTOD) aims to handle the extreme data imbalance in real-
world datasets, where many tail classes have scarce instances. One popular strategy is to …
world datasets, where many tail classes have scarce instances. One popular strategy is to …
Chartreader: A unified framework for chart derendering and comprehension without heuristic rules
Charts are a powerful tool for visually conveying complex data, but their comprehension
poses a challenge due to the diverse chart types and intricate components. Existing chart …
poses a challenge due to the diverse chart types and intricate components. Existing chart …
3dstyle-diffusion: Pursuing fine-grained text-driven 3d stylization with 2d diffusion models
3D content creation via text-driven stylization has played a fundamental challenge to
multimedia and graphics community. Recent advances of cross-modal foundation models …
multimedia and graphics community. Recent advances of cross-modal foundation models …
Leveraging temporal contextualization for video action recognition
We propose a novel framework for video understanding, called Temporally Contextualized
CLIP (TC-CLIP), which leverages essential temporal information through global interactions …
CLIP (TC-CLIP), which leverages essential temporal information through global interactions …