Scaling open-vocabulary object detection

M Minderer, A Gritsenko… - Advances in Neural …, 2024 - proceedings.neurips.cc
Open-vocabulary object detection has benefited greatly from pretrained vision-language
models, but is still limited by the amount of available detection training data. While detection …

Which tokens to use? investigating token reduction in vision transformers

JB Haurum, S Escalera, GW Taylor… - Proceedings of the …, 2023 - openaccess.thecvf.com
Since the introduction of the Vision Transformer (ViT), researchers have sought to make ViTs
more efficient by removing redundant information in the processed tokens. While different …

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Multi-resolution Time-Series Transformer for Long-term Forecasting

Y Zhang, L Ma, S Pal, Y Zhang… - … Conference on Artificial …, 2024 - proceedings.mlr.press
The performance of transformers for time-series forecasting has improved significantly.
Recent architectures learn complex temporal patterns by segmenting a time-series into …

Agglomerative Token Clustering

JB Haurum, S Escalera, GW Taylor… - European Conference on …, 2024 - Springer
Abstract We present Agglomerative Token Clustering (ATC), a novel token merging method
that consistently outperforms previous token merging and pruning methods across image …

Learning to Merge Tokens via Decoupled Embedding for Efficient Vision Transformers

DH Lee, S Hong - arxiv preprint arxiv:2412.10569, 2024 - arxiv.org
Recent token reduction methods for Vision Transformers (ViTs) incorporate token merging,
which measures the similarities between token embeddings and combines the most similar …

Multi-dimension Transformer with Attention-based Filtering for Medical Image Segmentation

W Wang, X **ao, M Liu, Q Lan, X Huang… - 2024 IEEE 36th …, 2024 - ieeexplore.ieee.org
The accurate segmentation of medical images is crucial for diagnosing and treating
diseases. Recent studies demonstrate that vision transformer-based methods have …

Accelerating Transformers with Spectrum-Preserving Token Merging

HC Tran, DMH Nguyen, DM Nguyen… - arxiv preprint arxiv …, 2024 - arxiv.org
Increasing the throughput of the Transformer architecture, a foundational component used in
numerous state-of-the-art models for vision and language tasks (eg, GPT, LLaVa), is an …

Token Compensator: Altering Inference Cost of Vision Transformer Without Re-tuning

S Jie, Y Tang, J Guo, ZH Deng, K Han… - European Conference on …, 2024 - Springer
Token compression expedites the training and inference of Vision Transformers (ViTs) by
reducing the number of the redundant tokens, eg, pruning inattentive tokens or merging …

From Similarity to Superiority: Channel Clustering for Time Series Forecasting

J Chen, JE Lenssen, A Feng, W Hu, M Fey… - arxiv preprint arxiv …, 2024 - arxiv.org
Time series forecasting has attracted significant attention in recent decades. Previous
studies have demonstrated that the Channel-Independent (CI) strategy improves forecasting …