🎉 Our paper "Is Less More? Exploring Token Condensation as Training-free Test-time Adaptation" has been accepted at ICCV 2025!
This work explores efficient online test-time adaptation for real-world distribution shifts, focusing on robustness without retraining.
📍 Conference: ICCV 2025
📄 Paper: Arxiv version
This repository contains the official implementation of our token condensation method for Vision Transformers with Test-Time Adaptation (TCA). Our approach demonstrates that intelligent token pruning combined with adaptive classification can improve both efficiency and performance during test-time adaptation to distribution shifts.
We explore whether reducing the number of visual tokens processed by Vision Transformers can improve efficiency without sacrificing performance. We implement and compare three token pruning strategies:
- EViT: Efficient Vision Transformer that drops tokens based on attention scores
- ToME: Token Merging that combines similar tokens
- Ours: Our novel token condensation approach using coreset averaging and hierarchical token selection
Our method is evaluated on CLIP (Contrastive Language-Image Pre-training) models across multiple datasets with test-time adaptation.
- 🚀 Efficient Token Condensation: Reduces computational cost by processing fewer tokens
- 🎯 Training-free Test-Time Adaptation: Adapts to test distributions using reservoir-based caching without retraining
- 📊 Comprehensive Evaluation: Tested on 15+ datasets including ImageNet variants
- 🔧 Flexible Framework: Easy to extend with new pruning methods
- 🌟 Real-world Robustness: Focuses on practical distribution shifts
- Python 3.9+
- CUDA-compatible GPU (recommended)
- Anaconda or Miniconda
- Clone the repository:
git clone https://github.com/Jo-wang/TCA.git
cd TCA- Create conda environment:
conda env create -f environment.yaml
conda activate TTAThe framework supports the following datasets:
ImageNet Variants:
- ImageNet (I)
- ImageNet-A (A) - Natural adversarial examples
- ImageNet-V (V) - ImageNetV2 matched frequency
- ImageNet-R (R) - Rendition
- ImageNet-S (S) - Sketch
Fine-grained Classification:
- Caltech101
- DTD (Describable Textures Dataset)
- EuroSAT
- FGVC (Fine-Grained Visual Classification)
- Food101
- Oxford Flowers
- Oxford Pets
- Stanford Cars
- SUN397
- UCF101
Based on the project's codebase, organize your datasets as follows:
data/
├── imagenet/
│ └── val/ # ImageNet validation images
├── imagenet-a/ # ImageNet-A (natural adversarial examples)
├── imagenet-r/ # ImageNet-R (rendition)
├── imagenet-s/
│ └── sketch/ # ImageNet-Sketch
├── caltech101/
│ └── 101_ObjectCategories/# Caltech101 images
├── dtd/
│ └── images/ # Describable Textures Dataset
├── eurosat/ # EuroSAT satellite images
├── fgvc/
│ └── data/
│ └── images/ # FGVC Aircraft images
├── food-101/
│ └── images/ # Food-101 images
├── oxford_flowers/ # Oxford Flowers images
├── oxford_pets/
│ ├── images/ # Oxford Pets images
│ └── annotations/ # Oxford Pets annotations
├── sun397/
│ └── SUN397/ # SUN397 images
└── ucf101/ # UCF101 action recognition
For detailed dataset preparation instructions, please refer to CoOp's data preparation guide.
Run the token pruning experiment with our method:
python runner.py If you want to ensure similar FLOPs cost for EViT, ToME, and Ours. Please set Ours = 0.035 when EViT and ToME = 0.1 in--token_pruning.
| Argument | Description | Default |
|---|---|---|
--config |
Path to configuration directory | configs/ |
--datasets |
Datasets to process | oxford_flowers |
--data-root |
Path to datasets directory | data/ |
--backbone |
CLIP model backbone | ViT-B/16 |
--token_pruning |
Pruning method and rate | Ours-0.035 |
--wandb-log |
Enable Weights & Biases logging | False |
--reservoir-sim |
Use cosine similarity for caching | True |
--div |
Use diverse samples for caching | True |
--token_sim |
Use token-level similarity | True |
--flag |
Fuse similarity with current sample | True |
Our method introduces several key innovations:
- Hierarchical Token Selection: Instead of binary keep/drop decisions, we use multiple levels of token importance
- Coreset Averaging: Groups similar tokens and represents them with fewer representative tokens
- Class Token Context: Leverages previously seen examples to guide token selection
- Information Preservation: Summarizes dropped tokens rather than discarding them completely
The TCA framework:
- Maintains a reservoir of representative samples per class
- Uses feature similarity to guide sample selection
- Dynamically updates predictions based on test distribution
- Combines CLIP predictions with reservoir-based adaptation
Our method achieves:
- Efficiency: Reduces token count by up to 90% in later transformer layers
- Performance: Maintains or improves accuracy compared to full token processing
- Adaptability: Better adaptation to test distributions through reservoir caching
If you use this code in your research, please cite:
@article{wang2024less,
title={Is Less More? Exploring Token Condensation as Training-free Test-time Adaptation},
author={Wang, Zixin and Gong, Dong and Wang, Sen and Huang, Zi and Luo, Yadan},
journal={arXiv preprint arXiv:2410.14729},
year={2024}
}- OpenAI CLIP for the base model
- EViT for efficient vision transformer implementation
- ToME for token merging techniques
- TDA for test-time adaptation on CLIP
For questions or issues, please open an issue on GitHub or contact [[email protected]].

