Turning a clip model into a scene text detector
The recent large-scale Contrastive Language-Image Pretraining (CLIP) model has shown
great potential in various downstream tasks via leveraging the pretrained vision and …
great potential in various downstream tasks via leveraging the pretrained vision and …
Towards robust real-time scene text detection: From semantic to instance representation learning
Due to the flexible representation of arbitrary-shaped scene text and simple pipeline, bottom-
up segmentation-based methods begin to be mainstream in real-time scene text detection …
up segmentation-based methods begin to be mainstream in real-time scene text detection …
Turning a clip model into a scene text spotter
We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP)
model to enhance scene text detection and spotting tasks, transforming it into a robust …
model to enhance scene text detection and spotting tasks, transforming it into a robust …
Clipter: Looking at the bigger picture in scene text recognition
Reading text in real-world scenarios often requires understanding the context surrounding it,
especially when dealing with poor-quality text. However, current scene text recognizers are …
especially when dealing with poor-quality text. However, current scene text recognizers are …
Perceiving ambiguity and semantics without recognition: an efficient and effective ambiguous scene text detector
Ambiguous scene text detection is an extremely challenging task. Existing text detectors that
rely solely on visual cues often suffer from confusion due to being evenly distributed in …
rely solely on visual cues often suffer from confusion due to being evenly distributed in …
Leveraging Contrastive Language–Image Pre-Training and Bidirectional Cross-attention for Multimodal Keyword Spotting
In resource-limited keyword spotting scenarios, the scarcity of annotated corpora hinders
deep learning's ability to develop robust models for representing acoustic features. Recent …
deep learning's ability to develop robust models for representing acoustic features. Recent …
Less is more: Removing text-regions improves clip training efficiency and robustness
The CLIP (Contrastive Language-Image Pre-training) model and its variants are becoming
the de facto backbone in many applications. However, training a CLIP model from hundreds …
the de facto backbone in many applications. However, training a CLIP model from hundreds …
End-to-end semi-supervised approach with modulated object queries for table detection in documents
Table detection, a pivotal task in document analysis, aims to precisely recognize and locate
tables within document images. Although deep learning has shown remarkable progress in …
tables within document images. Although deep learning has shown remarkable progress in …
SCOB: Universal Text Understanding via Character-wise Supervised Contrastive Learning with Online Text Rendering for Bridging Domain Gap
Inspired by the great success of language model (LM)-based pre-training, recent studies in
visual document understanding have explored LM-based pre-training methods for modeling …
visual document understanding have explored LM-based pre-training methods for modeling …
Hierarchical visual-semantic interaction for scene text recognition
Proper interaction between visual and semantic features is crucial to obtain a powerful
feature representation for scene text recognition (STR). The existing interaction methods …
feature representation for scene text recognition (STR). The existing interaction methods …