Human Review Checkpoint
Human review sits between automated scoring and the next iteration. Automated evaluators catch regressions and enforce thresholds, but a human reviewer spots score-behavior mismatches, qualitative regressions, and cases where a grader is too strict or too lenient.
When to review
Section titled “When to review”Review after every eval run where you plan to iterate on the skill or agent. The workflow:
- Run evals —
agentv eval EVAL.yamloragentv eval evals.json - Inspect results — open the HTML report or scan the results JSONL
- Write feedback — create
feedback.jsonalongside the results - Iterate — use the feedback to guide prompt changes, evaluator tuning, or test case additions
- Re-run — verify improvements in the next eval run
Skip the review step for routine CI gate runs where you only need pass/fail.
What to look for
Section titled “What to look for”| Signal | Example |
|---|---|
| Score-behavior mismatch | A test scores 0.9 but the output is clearly wrong — the grader missed an error |
| False positive | A contains check passes on a coincidental substring match |
| False negative | An LLM grader penalizes a correct answer that uses different phrasing |
| Qualitative regression | Scores stay the same but tone, formatting, or helpfulness degrades |
| Evaluator miscalibration | A code grader is too strict on whitespace; a rubric is too lenient on accuracy |
| Flaky results | The same test produces wildly different scores across runs |
How to review
Section titled “How to review”Inspect results
Section titled “Inspect results”For workspace evaluations (EVAL.yaml), use the trace viewer:
# View traces from a specific runagentv trace show results/2026-03-14T10-32-00_claude/traces.jsonl
# View the HTML report (if generated via #562)open results/2026-03-14T10-32-00_claude/report.htmlFor simple skill evaluations (evals.json), scan the results JSONL:
# Show failing testscat results/output.jsonl | jq 'select(.score < 0.8)'
# Show all scorescat results/output.jsonl | jq '{id: .testId, score: .score, verdict: .verdict}'Write feedback
Section titled “Write feedback”Create a feedback.json file in the results directory, alongside results.jsonl or output.jsonl:
results/ 2026-03-14T10-32-00_claude/ results.jsonl # automated eval results traces.jsonl # execution traces feedback.json # ← your review annotationsFeedback artifact schema
Section titled “Feedback artifact schema”The feedback.json file is a structured annotation of a single eval run. It records the reviewer’s qualitative assessment alongside the automated scores.
{ "run_id": "2026-03-14T10-32-00_claude", "reviewer": "engineer-name", "timestamp": "2026-03-14T12:00:00Z", "overall_notes": "Retrieval tests need more diverse queries. Code grader for format-check is too strict on trailing newlines.", "per_case": [ { "test_id": "test-feature-alpha", "verdict": "acceptable", "notes": "Score is borderline (0.72) but behavior is correct — the grader penalized for different phrasing." }, { "test_id": "test-retrieval-basic", "verdict": "needs_improvement", "notes": "Missing coverage of multi-document queries.", "evaluator_overrides": { "code-grader:format-check": "Too strict — penalized valid output with trailing newline", "llm-grader:quality": "Score 0.6 seems fair, answer was incomplete" }, "workspace_notes": "Workspace had stale cached files from previous run — may have affected retrieval results." }, { "test_id": "test-edge-case-empty", "verdict": "flaky", "notes": "Passed on 2 of 3 runs. Likely non-determinism in the agent's tool selection." } ]}Field reference
Section titled “Field reference”| Field | Type | Required | Description |
|---|---|---|---|
run_id | string | yes | Identifies the eval run (matches the results directory name or run identifier) |
reviewer | string | yes | Who performed the review |
timestamp | string (ISO 8601) | yes | When the review was completed |
overall_notes | string | no | High-level observations about the run |
per_case | array | no | Per-test-case annotations |
Per-case fields
Section titled “Per-case fields”| Field | Type | Required | Description |
|---|---|---|---|
test_id | string | yes | Matches the test id from the eval file |
verdict | enum | yes | One of: acceptable, needs_improvement, incorrect, flaky |
notes | string | no | Free-form reviewer notes |
evaluator_overrides | object | no | Keyed by evaluator name — reviewer annotations on specific evaluator results |
workspace_notes | string | no | Notes about workspace state (relevant for workspace evaluations) |
Verdict values
Section titled “Verdict values”| Verdict | Meaning |
|---|---|
acceptable | Automated score and actual behavior are both satisfactory |
needs_improvement | The output or coverage needs work — not a bug, but not good enough |
incorrect | The output is wrong, regardless of what the automated score says |
flaky | Results are inconsistent across runs — investigate non-determinism |
Evaluator overrides (workspace evaluations)
Section titled “Evaluator overrides (workspace evaluations)”For workspace evaluations with multiple evaluators (code graders, LLM graders, tool trajectory checks), the evaluator_overrides field lets the reviewer annotate specific evaluator results:
{ "test_id": "test-refactor-api", "verdict": "needs_improvement", "evaluator_overrides": { "code-grader:test-pass": "Tests pass but the refactored code has a subtle race condition the tests don't cover", "llm-grader:quality": "Score 0.9 is too high — the agent left dead code behind", "tool-trajectory:efficiency": "Used 12 tool calls where 5 would suffice, but the result is correct" }, "workspace_notes": "Agent cloned the repo correctly but didn't clean up temp files."}Keys use the format evaluator-type:evaluator-name to match the evaluators defined in assert blocks.
Storing feedback across iterations
Section titled “Storing feedback across iterations”Keep feedback files alongside results to build a history of review decisions:
results/ 2026-03-12T09-00-00_claude/ results.jsonl feedback.json # first iteration review 2026-03-14T10-32-00_claude/ results.jsonl feedback.json # second iteration review 2026-03-15T16-00-00_claude/ results.jsonl feedback.json # third iteration reviewThis creates a traceable record of what changed between iterations and why. When debugging a regression, check previous feedback.json files to see if the issue was noted before.
Integration with eval workflow
Section titled “Integration with eval workflow”The review checkpoint fits into the broader eval iteration loop:
Define tests (EVAL.yaml / evals.json) ↓ Run automated evals ↓ Review results ← you are here ↓ Write feedback.json ↓ Tune prompts / evaluators / test cases ↓ Re-run evals ↓ Compare with previous run (agentv compare) ↓ Review again (if iterating)Use agentv compare to quantify changes between runs, then review the diff to confirm that score improvements reflect genuine behavioral improvements.