Human Review Checkpoint

Human review sits between automated scoring and the next iteration. Automated evaluators catch regressions and enforce thresholds, but a human reviewer spots score-behavior mismatches, qualitative regressions, and cases where a grader is too strict or too lenient.

When to review

Review after every eval run where you plan to iterate on the skill or agent. The workflow:

Run evals — agentv eval EVAL.yaml or agentv eval evals.json
Inspect results — open the HTML report or scan the results JSONL
Write feedback — create feedback.json alongside the results
Iterate — use the feedback to guide prompt changes, evaluator tuning, or test case additions
Re-run — verify improvements in the next eval run

Skip the review step for routine CI gate runs where you only need pass/fail.

What to look for

Signal	Example
Score-behavior mismatch	A test scores 0.9 but the output is clearly wrong — the grader missed an error
False positive	A `contains` check passes on a coincidental substring match
False negative	An LLM grader penalizes a correct answer that uses different phrasing
Qualitative regression	Scores stay the same but tone, formatting, or helpfulness degrades
Evaluator miscalibration	A code grader is too strict on whitespace; a rubric is too lenient on accuracy
Flaky results	The same test produces wildly different scores across runs

How to review

Inspect results

For workspace evaluations (EVAL.yaml), use the trace viewer:

# View traces from a specific run
agentv trace show results/2026-03-14T10-32-00_claude/traces.jsonl

# View the HTML report (if generated via #562)
open results/2026-03-14T10-32-00_claude/report.html

For simple skill evaluations (evals.json), scan the results JSONL:

# Show failing tests
cat results/output.jsonl | jq 'select(.score < 0.8)'

# Show all scores
cat results/output.jsonl | jq '{id: .testId, score: .score, verdict: .verdict}'

Write feedback

Create a feedback.json file in the results directory, alongside results.jsonl or output.jsonl:

results/
  2026-03-14T10-32-00_claude/
    results.jsonl          # automated eval results
    traces.jsonl           # execution traces
    feedback.json          # ← your review annotations

Feedback artifact schema

The feedback.json file is a structured annotation of a single eval run. It records the reviewer’s qualitative assessment alongside the automated scores.

{
  "run_id": "2026-03-14T10-32-00_claude",
  "reviewer": "engineer-name",
  "timestamp": "2026-03-14T12:00:00Z",
  "overall_notes": "Retrieval tests need more diverse queries. Code grader for format-check is too strict on trailing newlines.",
  "per_case": [
    {
      "test_id": "test-feature-alpha",
      "verdict": "acceptable",
      "notes": "Score is borderline (0.72) but behavior is correct — the grader penalized for different phrasing."
    },
    {
      "test_id": "test-retrieval-basic",
      "verdict": "needs_improvement",
      "notes": "Missing coverage of multi-document queries.",
      "evaluator_overrides": {
        "code-grader:format-check": "Too strict — penalized valid output with trailing newline",
        "llm-grader:quality": "Score 0.6 seems fair, answer was incomplete"
      },
      "workspace_notes": "Workspace had stale cached files from previous run — may have affected retrieval results."
    },
    {
      "test_id": "test-edge-case-empty",
      "verdict": "flaky",
      "notes": "Passed on 2 of 3 runs. Likely non-determinism in the agent's tool selection."
    }
  ]
}

Field reference

Field	Type	Required	Description
`run_id`	`string`	yes	Identifies the eval run (matches the results directory name or run identifier)
`reviewer`	`string`	yes	Who performed the review
`timestamp`	`string` (ISO 8601)	yes	When the review was completed
`overall_notes`	`string`	no	High-level observations about the run
`per_case`	`array`	no	Per-test-case annotations

Per-case fields

Field	Type	Required	Description
`test_id`	`string`	yes	Matches the test `id` from the eval file
`verdict`	`enum`	yes	One of: `acceptable`, `needs_improvement`, `incorrect`, `flaky`
`notes`	`string`	no	Free-form reviewer notes
`evaluator_overrides`	`object`	no	Keyed by evaluator name — reviewer annotations on specific evaluator results
`workspace_notes`	`string`	no	Notes about workspace state (relevant for workspace evaluations)

Verdict values

Verdict	Meaning
`acceptable`	Automated score and actual behavior are both satisfactory
`needs_improvement`	The output or coverage needs work — not a bug, but not good enough
`incorrect`	The output is wrong, regardless of what the automated score says
`flaky`	Results are inconsistent across runs — investigate non-determinism

Evaluator overrides (workspace evaluations)

For workspace evaluations with multiple evaluators (code graders, LLM graders, tool trajectory checks), the evaluator_overrides field lets the reviewer annotate specific evaluator results:

{
  "test_id": "test-refactor-api",
  "verdict": "needs_improvement",
  "evaluator_overrides": {
    "code-grader:test-pass": "Tests pass but the refactored code has a subtle race condition the tests don't cover",
    "llm-grader:quality": "Score 0.9 is too high — the agent left dead code behind",
    "tool-trajectory:efficiency": "Used 12 tool calls where 5 would suffice, but the result is correct"
  },
  "workspace_notes": "Agent cloned the repo correctly but didn't clean up temp files."
}

Keys use the format evaluator-type:evaluator-name to match the evaluators defined in assert blocks.

Storing feedback across iterations

Keep feedback files alongside results to build a history of review decisions:

results/
  2026-03-12T09-00-00_claude/
    results.jsonl
    feedback.json          # first iteration review
  2026-03-14T10-32-00_claude/
    results.jsonl
    feedback.json          # second iteration review
  2026-03-15T16-00-00_claude/
    results.jsonl
    feedback.json          # third iteration review

This creates a traceable record of what changed between iterations and why. When debugging a regression, check previous feedback.json files to see if the issue was noted before.

Integration with eval workflow

The review checkpoint fits into the broader eval iteration loop:

Define tests (EVAL.yaml / evals.json)
        ↓
  Run automated evals
        ↓
  Review results ← you are here
        ↓
  Write feedback.json
        ↓
  Tune prompts / evaluators / test cases
        ↓
  Re-run evals
        ↓
  Compare with previous run (agentv compare)
        ↓
  Review again (if iterating)

Use agentv compare to quantify changes between runs, then review the diff to confirm that score improvements reflect genuine behavioral improvements.