Academic Project Page

BATCLIP: Bimodal Online Test-Time Adaptation for CLIP

¹The University of Texas at Dallas, ²MIT-IBM Watson AI Lab
ICCV 2025, Honolulu, Hawai'i 🌴🏖️

TL;DR🗒️

We demonstrate that zero-shot CLIP exhibits limited robustness to common image corruptions at test time, with poor transferability across domains. This highlights the need to adapt CLIP to unlabeled, corrupted images using test-time adaptation (TTA). To address this, we introduce BATCLIP, a bimodal, online (single gradient step) TTA method designed to enhance CLIP's robustness to common corruptions. This also extends and improves performance on domain generalization benchmarks.

Our framework of online CLIP adaptation at test-time 💡

BATCLIP framework: BATCLIP not only adapts the visual encoder for highly discriminative image features but also promotes a strong alignment between image and text features by adapting the text encoder too, leading to improved performance following test-time adaptation. Overall, our losses include entropy minimization, a projection matching loss between the visual class prototypes and the corresponding text feature. Our final loss is to encourage the seperation of prototypes to learn strong decision boundaries. We adapt only the LayerNorm parameters of CLIP encoders. This account for ~0.044% of total parameters.

Abstract

Although open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities, their robustness to common image corruptions remains poorly understood. Through extensive experiments, we show that zero-shot CLIP lacks robustness to common image corruptions during test-time, necessitating the adaptation of CLIP to unlabeled corrupted images using test-time adaptation (TTA). However, we found that existing TTA methods have severe limitations in adapting CLIP due to their unimodal nature. To address these limitations, we propose BATCLIP, a bimodal online TTA method designed to improve CLIP's robustness to common image corruptions. The key insight of our approach is not only to adapt the visual encoders for improving image features but also to strengthen the alignment between image and text features by promoting a stronger association between the image class prototype, computed using pseudo-labels, and the corresponding text feature. We evaluate our approach on benchmark image corruption datasets and achieve state-of-the-art results in online TTA for CLIP. Furthermore, we evaluate our proposed TTA approach on various domain generalization datasets to demonstrate its generalization capabilities.

Limited robustness of zero-shot CLIP across backbones 🧑‍💼

Task-wise mean accuracy (%) of zero-shot CLIP across corruption severity levels.
Dashed lines show zero-shot CLIP performance on corresponding source test sets (clean).

Mean classification accuracy (in %) across all the corruption types vs. accuracy on corresponding source test sets, evaluated using various prompt templates and across vision backbones.

Main Results 📈

We encourage you to read the paper for additional details, results, and in-depth discussions!

BATCLIP vs other TTA baselines on CIFAR-10C, CIFAR-100C, and ImageNet-C.

BATCLIP yields more discriminative visual features that exhibit stronger alignment with their corresponding text features. Please see the Appendix for further details.

BATCLIP extends to various domain generalization datasets too.

BibTeX citation 🔖

@inproceedings{maharana2025batclip, title={BATCLIP: Bimodal Online Test-Time Adaptation for CLIP}, author={Maharana, Sarthak Kumar and Zhang, Baoming and Karlinsky, Leonid and Feris, Rogerio and Guo, Yunhui}, booktitle={International Conference on Computer Vision (ICCV)}, year={2025} }