Tools for Evaluating Models on GCC
Vertex AI Model Eval and AutoSxS
Vertex AI provides two different tools for evaluating models
- ModelEval evaluates models on certain metrics like AUC, LogLoss and provides results
- AutoSxS compares results from two models and determines which one is the better answer
Note
There are no screenshots for this as currently the only way to invoke this is via code.
AWS Bedrock
Note
Currently only available in selected regions like N. Virginia
AWS Bedrock provides a simple tool to perform model evaluation based on metrics like accuracy and toxicity. It also leverages AWS SageMaker to add in human loops for manual human evaluation, which can be provided by the user or by AWS via AWS Mechanical Turk.
Setting up an evaluation task
Step 1: Select an evaluation task
Step 2: Configure the evaluation parameters (automatic evaluation in this example) Note the available evaluation task types.
Step 3: Create and wait for evaluation to complete
Step 4: View results The web console provides a summary view of the evaluation results for the metrics like accuracy and toxicity.
A more detailed per-entry evaluation result can be found in the output file
For human based evaluation, your humans will get an annotation task to manually provide evaluation results, instead of it being system generated and tabulated
Dataset Requirements
See here for the latest information.
Azure AI Studio
Note that Azure AI Studio is different from Azure OpenAI Studio
AI Studio provides an evaluation service to measure how well your model performs on primarily Q&A use cases.
Setting up an evaluation task
Step 1: Select the type of evaluation task
Step 2: Select the metrics to measure
Step 3: Upload dataset and map columns
Step 4: Run the evaluation
Step 5: View Results
For evaluation on Q&A without context, the tool helps provide an assessment of how well the model answer fits with relevance and groundedness against the source data.
For evaluation on Q&A with context, the tool also provides a summary of the metric scores obtained
There is also the option of setting up a manual evaluation task where users will manually evaluate each entry and provide a thumbs up or thumbs down
Dataset Requirements
The dataset should contain minimally the question, answer, and ground truth columns. A context column is also required depending on the evaluation task selected.