feat: machine-readable eval report for LLM optimization agents by StephaneWamba · Pull Request #112 · CodSpeedHQ/pytest-codspeed

StephaneWamba · 2026-04-01T17:25:23Z

This builds on #111 (output hash / correctness check) to close the loop for automated optimization agents.

The problem with #111 alone: the agent gets a human-readable terminal output but still needs to parse unstructured text and know the Python API to make a binary accept/reject decision.

What this adds:

--codspeed-eval-report=eval.json -- write a machine-readable JSON file after each comparison run:

{
  "aggregate_score": 0.33,
  "is_acceptable": true,
  "benchmarks": [
    {
      "name": "tests/test_sort.py::test_sort",
      "perf_gain": 0.33,
      "output_changed": false,
      "score": 0.33
    }
  ]
}

The agent reads this file and makes a decision. No Python knowledge needed, works from any language or shell script.

Scoring logic (also in #111):

aggregate_score is the minimum score across all benchmarks -- conservative by design, one correctness failure vetoes the whole suggestion
score = perf_gain when output is correct, 0.0 when broken, null when unknown (no --codspeed-capture-output)
is_acceptable = aggregate_score > 0.0 and not nan

Demo:

examples/optimize_loop.py shows the full loop: baseline run -> apply patch -> rerun with --codspeed-eval-report -> read JSON -> print verdict. Runs standalone with python examples/optimize_loop.py.

Tests added:

test_eval_report_written_after_second_run -- file is created, has the right keys
test_eval_report_acceptable_when_output_stable -- output_changed: false for unchanged output
test_eval_report_not_acceptable_when_output_breaks -- aggregate_score: 0.0, is_acceptable: false
test_eval_report_not_written_on_first_run -- no baseline means no file
Unit tests for aggregate_score, is_acceptable, to_dict including nan->null serialization

Depends on #111.

The result files written by each run were only used for CI uploads. This uses them locally too: on the second run, pytest-codspeed finds the most recent prior .codspeed/results_*.json and prints a short regression/improvement summary to the terminal. Skipped when --codspeed-profile-folder is set or in non-walltime modes. Implements the TODO left in plugin.py. Tests in test_comparison.py (unit) and test_comparison_integration.py (pytester end-to-end).

Adds --codspeed-capture-output flag. When set, each walltime run hashes the return value of the benchmarked function (pickle + sha256, repr fallback) and stores it in the result JSON alongside mean_ns. On the second run, the local comparison checks whether the hash changed. If it did, the report flags the benchmark with "! output changed" and counts correctness warnings in the footer. This closes the gap in optimisation loops where a suggestion improves perf but silently alters the function's output. Score formula exposed in eval_harness.py: score = perf_gain if output correct, 0 if broken, nan if capture was not enabled.

- Move _make_result/_bench helpers to conftest.py — single source of truth - Add result.assert_outcomes(passed=1) to all 9 integration tests so a broken feature cannot hide behind a passing stdout check - Fix test_no_comparison_with_profile_folder: CODSPEED_PROFILE_FOLDER is an env var, not a CLI flag; use monkeypatch.setenv instead - Ruff TC003/I001 fixes (Path in TYPE_CHECKING block, import sort)

- EvalReport.aggregate_score: conservative min-score across all benchmarks; 0.0 if any correctness broken, nan if any unknown - EvalReport.is_acceptable: single bool for binary accept/reject - EvalReport.to_dict(): JSON-serializable dict (nan -> null) - --codspeed-eval-report=PATH: write the eval report as JSON after each comparison run, so automated agents need no Python API knowledge - examples/optimize_loop.py: runnable demo of the full baseline -> patch -> rerun -> read verdict loop

StephaneWamba added 4 commits April 1, 2026 15:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: machine-readable eval report for LLM optimization agents#112

feat: machine-readable eval report for LLM optimization agents#112
StephaneWamba wants to merge 4 commits intoCodSpeedHQ:masterfrom
StephaneWamba:feat/eval-report-output

StephaneWamba commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

StephaneWamba commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant