To launch a model evaluation run, navigate to the Evaluation Runs tab and create a new evaluation run.
A run consists of 3 key components:
- Evaluation metrics - These are the metrics you want your model responses evaluated on by our human annotators (e.g., overall quality, helpfulness, factuality).
- Prompt set - A prompt set is the set of prompts that your models will generate responses to.
- Models - Models are the large language models you want to evaluate. You can either define an endpoint or upload responses you've already collected from the model (or both!).
Once you determine which metrics, prompts and models to use you can go ahead and launch your run! Upon run completion you will have access to our model analytics, which will provide a breakdown of each models performance and each metric.
Return to: Model Evaluation