Realistic Evaluation of Test-Time Adaptation Algorithms: Unsupervised Hyperparameter Selection (2407.14231v1)

Published 19 Jul 2024 in cs.LG and cs.CV

Abstract: Test-Time Adaptation (TTA) has recently emerged as a promising strategy for tackling the problem of machine learning model robustness under distribution shifts by adapting the model during inference without access to any labels. Because of task difficulty, hyperparameters strongly influence the effectiveness of adaptation. However, the literature has provided little exploration into optimal hyperparameter selection. In this work, we tackle this problem by evaluating existing TTA methods using surrogate-based hp-selection strategies (which do not assume access to the test labels) to obtain a more realistic evaluation of their performance. We show that some of the recent state-of-the-art methods exhibit inferior performance compared to the previous algorithms when using our more realistic evaluation setup. Further, we show that forgetting is still a problem in TTA as the only method that is robust to hp-selection resets the model to the initial state at every step. We analyze different types of unsupervised selection strategies, and while they work reasonably well in most scenarios, the only strategies that work consistently well use some kind of supervision (either by a limited number of annotated test samples or by using pretraining data). Our findings underscore the need for further research with more rigorous benchmarking by explicitly stating model selection strategies, to facilitate which we open-source our code.

Authors (4)

Sebastian Cygert (18 papers)
Damian Sójka (4 papers)
Tomasz Trzciński (116 papers)
Bartłomiej Twardowski (37 papers)

Summary

The paper "Realistic Evaluation of Test-Time Adaptation Algorithms: Unsupervised Hyperparameter Selection" explores the challenge of robust machine learning model performance under distributional shifts—addressed by Test-Time Adaptation (TTA). TTA is designed to adapt a pre-trained model during inference in the absence of label data by adjusting hyperparameters, which significantly impact the adaptation process.

Key Contributions and Findings:

Unsanctioned Hyperparameter Selection: The paper critically examines the largely unexplored area of unsupervised hyperparameter (hp) selection in TTA. Prior works often use oracle strategies, employing test labels which are impractical in real-world applications. This paper proposes surrogate-based methods for a more pragmatic evaluation.
Evaluation of Surrogate Measures:

Varied surrogate measures are evaluated for model selection without test labels: - Source Accuracy (s-acc): Employs accuracy from a source domain validation set to infer test domain performance. - Cross-validation Accuracy (c-acc): Uses hyperparameters optimized on another benchmark (e.g., ImageNet-C) for selection. - Entropy (ent): Minimizes prediction entropy analogous to TENT, though susceptible to confident misclassifications. - Consistency (con): Utilizes consistency regularization across augmented test images, works well for certain TTA objectives but with limited scenarios. - Soft Neighborhood Density (snd): High score indicates dense clustering in feature space—useful in clustering scenarios.

Challenges and Insights:
- Results indicate substantial variability in TTA method efficacy based on selected hyperparameters and metrics. While methods like AdaContrast exhibit potential with ideal hyperparameter settings, they perform sub-optimally under more challenging scenarios.
- Stability issues worsen with increased adaptation length or temporal class correlation, significantly impacting the correlation between surrogate metrics and target accuracy, thereby complicating hyperparameter selection.
- Even limited access to target label data vastly improves hyperparameter selection capability over purely unsupervised strategies.
Benchmark and Methodologies:
- Experiments are conducted across diverse datasets (e.g., CIFAR100-C, ImageNet-R) with novel conditions (e.g., correlation, long sequences).
- Seven TTA methods are scrutinized, highlighting that methods with simpler objectives or reset mechanisms (e.g., MEMO) tend to demonstrate better hyperparameter robustness.
- The paper emphasizes the importance of clarity in reporting model selection strategies to facilitate reproducibility and claims adjustment.

Conclusions and Recommendations:

For TTA Development: Emphasizes leveraging source data when available for enhanced hyperparameter selection. Employing a task-objective-aligned unsupervised selection strategy (e.g., using consistency measures for consistency-driven TTA methods) is advocated to minimize degenerate adaptations.
For Research and Practice: Highlights the necessity for benchmark heterogeneity in test cases to reflect adaptation reliability across scenarios beyond i.i.d. data.
For Future Direction: Suggests expanding the exploration of selection strategies and developing parameter-free methods to simplify the deployment of TTA systems in real-world settings.

The paper underscores the complexity of implementing effective TTA methods under real-world conditions, pointing out that while some techniques perform admirably under controlled settings, achieving consistency across various scenarios remains challenging without access to supervised data or perfect oracle-derived parameters.

PDF Markdown

Realistic Evaluation of Test-Time Adaptation Algorithms: Unsupervised Hyperparameter Selection (2407.14231v1)

Summary

Key Contributions and Findings:

Conclusions and Recommendations:

Related Papers