Skip to main content

Evaluate code recommendations

Once you have generated alternative code recommendations for your selected code snippets, you can use Artemis to evaluate the quality of the code recommendations. Artemis offers code scoring and code validation options, which are discussed below.

In order to evaluate recommendations, first ensure that you are on the Snippets section of your selected code project, and that you have generated at least one recommendation for your codebase. See image below:

Snippets overview

To start the scoring, click on the "balance" icon, either next to a given code snippet, or at the top of the Recommendation Scores column.

About icons on Artemis

Across the platform, the "balance" icon represents the scoring options. See image:

Scorer icon

Clicking on the balance icon will give you a list of LLM and scoring criteria. See image for details:

Scoring options

Use the checkboxes under Scoring to select which LLMs you would like to use to generate the scores.

Be mindful of LLM costs

Using certain LLMs will include a cost. Pay attention to the Estimated Cost item at the bottom of the Scoring panel to ensure that you are aware of how much it would cost to run the scoring task.

Use the checkboxes under Built-in Score to select the criteria along which you would like to score your code. You are also able to define custom scoring metrics by clicking the Add button next to Custom Score.

Set your own Custom Score by specifying the Task, Subtask, Prompt Name and the Prompt Text in the given spaces.

See the example below:

custom prompt

Once you click Save the prompt will appear under the Custom Scores.

Once you select the preferred models and scores, click Score under Scoring.

When the scoring task is complete, your scores will appear under the Recommendation Scores tab.

About icons on Artemis

Across your scoring tasks, if you define a custom prompt, it will appear with a "human figure outline" icon.

Interpreting the scores

Artemis gives six default scores. Here is how you can interpret them:

ScoreExplanation
PerformanceMeasures how well the code recommendation improve the predicted performance of the piece of code. Performance considers runtime and memory usage as criteria to measure performance, but these are not the only two metrics.
QualityMeasures how well the suggested code improves code quality, readability, and maintainability of the piece of code.
CompatibilityChecks whether the the recommended change contains backward compatibility issues.
SecurityChecks whether the code changes are secure, and avoids any security issues.
EquivalenceChecks whether the changes preserve the functioning of the code, and whether the general logic of the suggested piece of code is the same as the original piece of code.
ValueMeasures how much value the suggested code snippet brings to the project.
Average ScoreThis is an average of the above six metrics. If you selected multiple LLMs for the scoring task, then the `Average Score

Click on the tile of each score on Artemis to get an explanation of why this score was allocated to the snippet.

High scores are an indication that recommended code may be better than your original code snippets.

Artemis provides three options to validate the recommended code snippets: Compile checks whether the recommended code will compile, Test runs unit test to check for functionality, Benchmark measures the performance. In order to run these validations, click the "play" icon next to Validations, and click on the preferred tests. Click Run to start the validation process. See image below:

Validate options

Once the validation process is complete, the status of each selected test will be displayed under Validations. See image below:

Validation status

Based on code scores and validation outcomes, you can decide which of the snippets to incorporate into your codebase.