Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …
have achieved outstanding performance, which pursue semantic interaction upon pre …
Diffusionret: Generative text-video retrieval with diffusion model
Existing text-video retrieval solutions are, in essence, discriminant models focused on
maximizing the conditional likelihood, ie, p (candidates| query). While straightforward, this …
maximizing the conditional likelihood, ie, p (candidates| query). While straightforward, this …
M-mix: Generating hard negatives via multi-sample mixing for contrastive learning
Negative pairs, especially hard negatives as combined with common negatives (easy to
discriminate), are essential in contrastive learning, which plays a role of avoiding …
discriminate), are essential in contrastive learning, which plays a role of avoiding …
Text-video retrieval with disentangled conceptualization and set-to-set alignment
Text-video retrieval is a challenging cross-modal task, which aims to align visual entities with
natural language descriptions. Current methods either fail to leverage the local details or are …
natural language descriptions. Current methods either fail to leverage the local details or are …
Out-of-distributed semantic pruning for robust semi-supervised learning
Recent advances in robust semi-supervised learning (SSL) typical filters out-of-distribution
(OOD) information at the sample level. We argue that an overlooked problem of robust SSL …
(OOD) information at the sample level. We argue that an overlooked problem of robust SSL …
Patch-level contrastive learning via positional query for visual pre-training
Dense contrastive learning (DCL) has been recently explored for learning localized
information for dense prediction tasks (eg, detection and segmentation). It still suffers the …
information for dense prediction tasks (eg, detection and segmentation). It still suffers the …
Know your self-supervised learning: A survey on image-based generative and discriminative training
Although supervised learning has been highly successful in improving the state-of-the-art in
the domain of image-based computer vision in the past, the margin of improvement has …
the domain of image-based computer vision in the past, the margin of improvement has …
Invariant Graph Learning Meets Information Bottleneck for Out-of-Distribution Generalization
Graph out-of-distribution (OOD) generalization remains a major challenge in graph learning
since graph neural networks (GNNs) often suffer from severe performance degradation …
since graph neural networks (GNNs) often suffer from severe performance degradation …
QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition
Audiovisual segmentation (AVS) is a challenging task that aims to segment visual objects in
videos according to their associated acoustic cues. With multiple sound sources and …
videos according to their associated acoustic cues. With multiple sound sources and …
Pcp-mae: Learning to predict centers for point masked autoencoders
Masked autoencoder has been widely explored in point cloud self-supervised learning,
whereby the point cloud is generally divided into visible and masked parts. These methods …
whereby the point cloud is generally divided into visible and masked parts. These methods …