Evaluating davinci-003: Performance, Improvements, and Gaps | Scale AI