This repository collects research papers on unsupervised learning methods with VLMs. The repository will be continuously updated to track the latest work in the community.
Keywords: unsupervised learning, test-time adaptation, vision-language models.
- [Aug 7, 2025] Just finished the manuscript.
- Data-Free Transfer
- Unsupervised Domain Transfer
- Episodic Test-Time Adaptation
- Online Test-Time Adaptation
Please visit Adapting Vision-Language Models Without Labels: A Comprehensive Survey for more details and comprehensive information. If you find our paper and repo helpful, please consider citing it as follows:
@article{dong2025adapting,
title={Adapting Vision-Language Models Without Labels: A Comprehensive Survey},
author={Dong, Hao and Sheng, Lijun and Liang, Jian and He, Ran and Chatzi, Eleni and Fink, Olga},
journal={arXiv preprint arXiv:2508.05547},
year={2025}}ICCV-2025FLOSS: Free Lunch in Open-vocabulary Semantic SegmentationICCV-2025Generate, Transduct, Adapt: Iterative Transduction with VLMsICCV-2025BATCLIP: Bimodal Online Test-Time Adaptation for CLIPICCV-2025Is Less More? Exploring Token Condensation as Training-free Test-time AdaptationICML-2025From Local Details to Global Context: Advancing Vision-Language Models with Attention-based SelectionICML-2025GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language ModelsCVPR-2025CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIPCVPR-2025TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language ModelsCVPR-2025SCAP: Transductive Test-Time Adaptation via Supportive Clique-based Attribute PromptingCVPR-2025O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language ModelsCVPR-2025R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt TuningCVPR-2025Realistic Test-Time Adaptation of Vision-Language ModelsCVPR-2025SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language ModelsCVPR-2025Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EMCVPR-2025Bayesian Test-Time Adaptation for Vision-Language ModelsCVPR-2025COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time AdaptationCVPR-2025Hierarchical Knowledge Prompt Tuning for Multi-task Test-Time AdaptationCVPR-2025On the Zero-shot Adversarial Robustness of Vision-Language Models: A Truly Zero-shot and Training-free ApproachICLR-2025RA-TTA: Retrieval-Augmented Test-Time Adaptation for Vision-Language ModelsICLR-2025Noisy Test-Time Adaptation in Vision-Language ModelsICLR-2025Efficient and Context-Aware Label Propagation for Zero-/Few-Shot Training-Free Adaptation of Vision-Language ModelICLR-2025DynaPrompt: Dynamic Test-Time Prompt TuningICLR-2025Test-time Adaptation for Cross-modal Retrieval with Query ShiftAAAI-2025Learning to Prompt with Text Only Supervision for Vision-Language ModelsAAAI-2025Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation ModelWACV-2025Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic SegmentationWACV-2025Enhancing Visual Classification using Comparative DescriptorsWACV-2025DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language ModelsWACV-2025LATTECLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic TextsWACV-2025Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language ModelsWACV-2025Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language ModelsWACV-2025Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic SegmentationWACV-2025CLIPArTT: Adaptation of CLIP to New Domains at Test TimeIJCV-2025Diffusion-Enhanced Test-time Adaptation with Text and Image AugmentationTIP-2025Task-to-Instance Prompt Learning for Vision-Language Models at Test TimePR-2025A Closer Look at the Explainability of Contrastive Language-Image Pre-trainingPR-2025CTPT: Continual Test-time Prompt Tuning for Vision-Language ModelsNeurIPS-2024Boosting Vision-Language Models with TransductionNeurIPS-2024OTTER: Effortless Label Distribution Adaptation of Zero-shot ModelsNeurIPS-2024Frustratingly Easy Test-Time Adaptation of Vision-Language ModelsNeurIPS-2024AWT: Transferring Vision-Language Models via Augmentation, Weighting, and TransportationNeurIPS-2024WATT: Weight Average Test-Time Adaptation of CLIPNeurIPS-2024Dual Prototype Evolving for Test-Time Generalization of Vision-Language ModelsNeurIPS-2024BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional BootstrappingNeurIPS-2024Historical Test-time Prompt Tuning for Vision Foundation ModelsACMMM-2024WaveDN: A Wavelet-based Training-free Zero-shot Enhancement for Vision-Language ModelsACMMM-2024Towards Robustness Prompt Tuning with Fully Test-Time Adaptation for CLIP’s Zero-Shot GeneralizationECCV-2024Meta-Prompting for Automating Zero-shot Visual Recognition with LLMsECCV-2024SCLIP: Rethinking Self-Attention for Dense Vision-Language InferenceECCV-2024ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language InferenceECCV-2024ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary SegmentationECCV-2024TAG: Text Prompt Augmentation for Zero-Shot Out-of-Distribution DetectionECCV-2024Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge DistillationECCV-2024uCAP: An Unsupervised Prompting Method for Vision-Language ModelsECCV-2024Robust Calibration of Large Vision-Language AdaptersECCV-2024Robust Calibration of Large Vision-Language AdaptersECCV-2024Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic SegmentationECCV-2024In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic SegmentationECCV-2024Online Zero-Shot Classification with CLIPIJCAI-2024DTS-TPT: Dual Temporal-Sync Test-time Prompt Tuning for Zero-shot Activity RecognitionICML-2024Envisioning Outlier Exposure by Large Language Models for Out-of-Distribution DetectionICML-2024Realistic Unsupervised CLIP Fine-tuning with Universal Entropy OptimizationICML-2024Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled DataICML-2024Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language ModelsICML-2024Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language ModelsCVPR-2024Grounding Everything: Emerging Localization Properties in Vision-Language TransformersCVPR-2024The Neglected Tails in Vision-Language ModelsCVPR-2024PromptKD: Unsupervised Prompt Distillation for Vision-Language ModelsCVPR-2024Label Propagation for Zero-shot Classification with Vision-Language ModelsCVPR-2024Transductive Zero-Shot and Few-Shot CLIPCVPR-2024On the Test-Time Zero-Shot Generalization of Vision-Language Models: Do We Really Need Prompt Learning?CVPR-2024Test-Time Zero-Shot Temporal Action LocalizationCVPR-2024Leveraging Cross-Modal Neighbor Representation for Improved CLIP ClassificationCVPR-2024Grounding Everything: Emerging Localization Properties in Vision-Language TransformersCVPR-2024Efficient Test-Time Adaptation of Vision-Language ModelsCVPR-2024Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language ModelsCVPR-2024Improved Self-Training for Test-Time AdaptationCVPR-2024Any-Shift Prompting for Generalization over DistributionsICLR-2024Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language ModelsICLR-2024C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature DispersionICLR-2024PerceptionCLIP: Visual Classification by Inferring and Conditioning on ContextsICLR-2024Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image ClassificationAAAI-2024Robust Test-Time Adaptation for Zero-Shot Prompt TuningAAAI-2024DART: Dual-Modal Adaptive Online Prompting and Knowledge Retention for Test-Time AdaptationWACV-2024ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain AdaptationWACV-2024CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-FreeWACV-2024DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D ClassificationNeurIPS-2023Neural Priming for Sample-Efficient AdaptationNeurIPS-2023ChatGPT-Powered Hierarchical Comparisons for Image ClassificationNeurIPS-2023Neural Priming for Sample-Efficient AdaptationNeurIPS-2023LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image CollectionsNeurIPS-2023Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt TuningNeurIPS-2023Intra-Modal Proxy Learning for Zero-Shot Visual Categorization with CLIPNeurIPS-2023SwapPrompt: Test-Time Prompt Adaptation for Vision-Language ModelsNeurIPS-2023Diffusion-TTA: Test-time Adaptation of Discriminative Models via Generative FeedbackNeurIPS-2023Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot GeneralizationNeurIPS-2023SwapPrompt: Test-Time Prompt Adaptation for Vision-Language ModelsNeurIPS-2023Test-Time Distribution Normalization for Contrastively Learned Vision-language ModelsACMMM-2023VPA: Fully Test-Time Visual Prompt AdaptationICCV-2023What Does a Platypus Look Like? Generating Customized Prompts for Zero-Shot Image ClassificationICCV-2023SuS-X: Training-Free Name-Only Transfer of Vision-Language ModelsICCV-2023Waffling around for Performance: Visual Classification with Random Words and Broad ConceptsICCV-2023Diverse Data Augmentation with Diffusions for Effective Test-time Prompt TuningICML-2023CHiLS: Zero-Shot Image Classification with Hierarchical Label SetsICML-2023A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image ModelsICML-2023POUF: Prompt-Oriented Unsupervised Fine-tuning for Large Pre-trained ModelsCVPR-2023Texts as Images in Prompt Tuning for Multi-Label Image RecognitionCVPR-2023Improving Zero-shot Generalization and Robustness of Multi-modal ModelsCVPR-2023Texts as Images in Prompt Tuning for Multi-Label Image RecognitionICLR-2023Visual Classification via Description from Large Language ModelsICLR-2023Masked Unsupervised Self-training for Label-free Image ClassificationAAAI-2023CALIP: Zero-Shot Enhancement of CLIP with Parameter-free AttentionNeurIPS-2022ReCo: Retrieve and Co-segment for Zero-shot TransferNeurIPS-2022Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language ModelsNeurIPS-2022ReCo: Retrieve and Co-segment for Zero-shot TransferECCV-2022Extract Free Dense Labels from CLIPICML-2021Learning Transferable Visual Models From Natural Language Supervision
