Calibration of OpenAI o3 and o4-mini on Humanity's Last Exam | Scale AI