Skip to content

feat: machine-readable eval report for LLM optimization agents#112

Open
StephaneWamba wants to merge 4 commits intoCodSpeedHQ:masterfrom
StephaneWamba:feat/eval-report-output
Open

feat: machine-readable eval report for LLM optimization agents#112
StephaneWamba wants to merge 4 commits intoCodSpeedHQ:masterfrom
StephaneWamba:feat/eval-report-output

Conversation

@StephaneWamba
Copy link
Copy Markdown

This builds on #111 (output hash / correctness check) to close the loop for automated optimization agents.

The problem with #111 alone: the agent gets a human-readable terminal output but still needs to parse unstructured text and know the Python API to make a binary accept/reject decision.

What this adds:

--codspeed-eval-report=eval.json -- write a machine-readable JSON file after each comparison run:

{
  "aggregate_score": 0.33,
  "is_acceptable": true,
  "benchmarks": [
    {
      "name": "tests/test_sort.py::test_sort",
      "perf_gain": 0.33,
      "output_changed": false,
      "score": 0.33
    }
  ]
}

The agent reads this file and makes a decision. No Python knowledge needed, works from any language or shell script.

Scoring logic (also in #111):

  • aggregate_score is the minimum score across all benchmarks -- conservative by design, one correctness failure vetoes the whole suggestion
  • score = perf_gain when output is correct, 0.0 when broken, null when unknown (no --codspeed-capture-output)
  • is_acceptable = aggregate_score > 0.0 and not nan

Demo:

examples/optimize_loop.py shows the full loop: baseline run -> apply patch -> rerun with --codspeed-eval-report -> read JSON -> print verdict. Runs standalone with python examples/optimize_loop.py.

Tests added:

  • test_eval_report_written_after_second_run -- file is created, has the right keys
  • test_eval_report_acceptable_when_output_stable -- output_changed: false for unchanged output
  • test_eval_report_not_acceptable_when_output_breaks -- aggregate_score: 0.0, is_acceptable: false
  • test_eval_report_not_written_on_first_run -- no baseline means no file
  • Unit tests for aggregate_score, is_acceptable, to_dict including nan->null serialization

Depends on #111.

The result files written by each run were only used for CI uploads.
This uses them locally too: on the second run, pytest-codspeed finds
the most recent prior .codspeed/results_*.json and prints a short
regression/improvement summary to the terminal.

Skipped when --codspeed-profile-folder is set or in non-walltime modes.
Implements the TODO left in plugin.py.

Tests in test_comparison.py (unit) and test_comparison_integration.py
(pytester end-to-end).
Adds --codspeed-capture-output flag. When set, each walltime run hashes
the return value of the benchmarked function (pickle + sha256, repr
fallback) and stores it in the result JSON alongside mean_ns.

On the second run, the local comparison checks whether the hash changed.
If it did, the report flags the benchmark with "! output changed" and
counts correctness warnings in the footer.

This closes the gap in optimisation loops where a suggestion improves
perf but silently alters the function's output. Score formula exposed
in eval_harness.py: score = perf_gain if output correct, 0 if broken,
nan if capture was not enabled.
- Move _make_result/_bench helpers to conftest.py — single source of truth
- Add result.assert_outcomes(passed=1) to all 9 integration tests so a
  broken feature cannot hide behind a passing stdout check
- Fix test_no_comparison_with_profile_folder: CODSPEED_PROFILE_FOLDER is
  an env var, not a CLI flag; use monkeypatch.setenv instead
- Ruff TC003/I001 fixes (Path in TYPE_CHECKING block, import sort)
- EvalReport.aggregate_score: conservative min-score across all
  benchmarks; 0.0 if any correctness broken, nan if any unknown
- EvalReport.is_acceptable: single bool for binary accept/reject
- EvalReport.to_dict(): JSON-serializable dict (nan -> null)
- --codspeed-eval-report=PATH: write the eval report as JSON after
  each comparison run, so automated agents need no Python API knowledge
- examples/optimize_loop.py: runnable demo of the full baseline ->
  patch -> rerun -> read verdict loop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant