Enabling Calibration In The Zero-shot Inference Of Large Vision-Language Models

Will Levine , Benjamin Pikus , Pranav Raja & Fernando Amat Gil

denotes equal contribution

Abstract: Calibration of deep learning models is crucial to their trustworthiness and safe usage, and as such, has been extensively studied in supervised classification models, with methods rafted to decrease miscalibration. However, there has yet to be a comprehensive study of the calibration of vision-language models that are used for zero-shot inference, like CLIP. We measure calibration across relevant variables like prompt, dataset, and architecture, and find that zero-shot inference with CLIP is miscalibrated. Furthermore, we propose a modified version of temperature scaling that is  ligned with the common use cases of CLIP as a zero-shot inference model, and show that a single learned temperature generalizes for each specific CLIP model (defined by a chosen pre-training dataset and architecture) across inference dataset and prompt choice.

Accepted to workshop on the pitfalls of limited data and computation for Trustworthy ML, ICLR 2023.

Bibtex Citation

@misc{levine2023enablingcalibrationzeroshotinference,
      title={Enabling Calibration In The Zero-Shot Inference of Large Vision-Language Models}, 
      author={Will LeVine and Benjamin Pikus and Pranav Raja and Fernando Amat Gil},
      year={2023},
      eprint={2303.12748},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2303.12748}, 
}