Running Evaluations
Run an Evaluation
Section titled “Run an Evaluation”agentv eval evals/my-eval.yamlResults are written to .agentv/results/eval_<timestamp>.jsonl. Each line is a JSON object with one result per test case.
Each scores[] entry includes per-grader timing:
{ "scores": [ { "name": "format_structure", "type": "llm-grader", "score": 0.9, "verdict": "pass", "hits": ["clear structure"], "misses": [], "duration_ms": 9103, "started_at": "2026-03-09T00:05:10.123Z", "ended_at": "2026-03-09T00:05:19.226Z", "token_usage": { "input": 2711, "output": 2535 } } ]}The duration_ms, started_at, and ended_at fields are present on every grader result (including code-grader), enabling per-grader bottleneck analysis.
Common Options
Section titled “Common Options”Override Target
Section titled “Override Target”Run against a different target than specified in the eval file:
agentv eval --target azure-base evals/**/*.yamlRun Specific Test
Section titled “Run Specific Test”Run a single test by ID:
agentv eval --test-id case-123 evals/my-eval.yamlDry Run
Section titled “Dry Run”Test the harness flow with mock responses (does not call real providers):
agentv eval --dry-run evals/my-eval.yamlOutput to Specific File
Section titled “Output to Specific File”agentv eval evals/my-eval.yaml --out results/baseline.jsonlTrace Persistence
Section titled “Trace Persistence”Export execution traces (tool calls, timing, spans) to files for debugging and analysis:
# Human-readable JSONL trace (one record per test case)agentv eval evals/my-eval.yaml --trace-file traces/eval.jsonl
# OTLP JSON trace (importable by OTel backends like Jaeger, Grafana)agentv eval evals/my-eval.yaml --otel-file traces/eval.otlp.json
# Both formats simultaneouslyagentv eval evals/my-eval.yaml --trace-file traces/eval.jsonl --otel-file traces/eval.otlp.jsonThe --trace-file format writes JSONL records containing:
test_id- The test identifiertarget/score- Target and evaluation scoreduration_ms- Total execution durationspans- Array of tool invocations with timingtoken_usage/cost_usd- Resource consumption
The --otel-file format writes standard OTLP JSON that can be imported into any OpenTelemetry-compatible backend.
Live OTel Export
Section titled “Live OTel Export”Stream traces directly to an observability backend during evaluation using --export-otel:
# Use a backend preset (braintrust, langfuse, confident)agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust
# Include message content and tool I/O in spans (disabled by default for privacy)agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-capture-content
# Group messages into turn spans for multi-turn evaluationsagentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-group-turnsBraintrust
Section titled “Braintrust”Set up your environment:
export BRAINTRUST_API_KEY=sk-...export BRAINTRUST_PROJECT=my-project # associates traces with a Braintrust projectRun an eval with traces sent to Braintrust:
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-capture-contentThe following environment variables control project association (at least one is required):
| Variable | Format | Example |
|---|---|---|
BRAINTRUST_PROJECT | Project name | my-evals |
BRAINTRUST_PROJECT_ID | Project UUID | proj_abc123 |
BRAINTRUST_PARENT | Raw x-bt-parent header | project_name:my-evals |
Each eval test case produces a trace with:
- Root span (
agentv.eval) — test ID, target, score, duration - LLM call spans (
chat <model>) — model name, token usage (input/output/cached) - Tool call spans (
execute_tool <name>) — tool name, arguments, results (with--otel-capture-content) - Turn spans (
agentv.turn.N) — groups messages by conversation turn (with--otel-group-turns) - Evaluator events — per-grader scores attached to the root span
Langfuse
Section titled “Langfuse”export LANGFUSE_PUBLIC_KEY=pk-...export LANGFUSE_SECRET_KEY=sk-...# Optional: export LANGFUSE_HOST=https://cloud.langfuse.com
agentv eval evals/my-eval.yaml --export-otel --otel-backend langfuse --otel-capture-contentCustom OTLP Endpoint
Section titled “Custom OTLP Endpoint”For backends not covered by presets, configure via environment variables:
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-backend/v1/tracesexport OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer token"
agentv eval evals/my-eval.yaml --export-otelWorkspace Modes and Finish Policy
Section titled “Workspace Modes and Finish Policy”Use workspace mode and finish policies instead of multiple conflicting booleans:
# Mode: pooled | temp | staticagentv eval evals/my-eval.yaml --workspace-mode pooled
# Static mode pathagentv eval evals/my-eval.yaml --workspace-mode static --workspace-path /path/to/workspace
# Pooled reset policy override: standard | full (CLI override)agentv eval evals/my-eval.yaml --workspace-clean full
# Finish policy overrides: keep | cleanup (CLI)agentv eval evals/my-eval.yaml --retain-on-success cleanup --retain-on-failure keepEquivalent eval YAML:
workspace: mode: pooled # pooled | temp | static path: null # workspace path for mode=static; auto-materialised when empty/missing hooks: enabled: true # set false to skip all hooks after_each: reset: fast # none | fast | strictNotes:
- Pooling is default for shared workspaces with repos when mode is not specified.
mode: static(or--workspace-mode static) usespath/--workspace-path. When the path is empty or missing, the workspace is auto-materialised (template copied + repos cloned). Populated directories are reused as-is.- Static mode is incompatible with
isolation: per_test. hooks.enabled: falseskips all lifecycle hooks (setup, teardown, reset).- Pool slots are managed separately (
agentv workspace list|clean).
Retry Execution Errors
Section titled “Retry Execution Errors”Re-run only the tests that had infrastructure/execution errors from a previous output:
agentv eval evals/my-eval.yaml --retry-errors .agentv/results/eval_previous.jsonlThis reads the previous JSONL, filters for executionStatus === 'execution_error', and re-runs only those test cases. Non-error results from the previous run are preserved and merged into the new output.
Execution Error Tolerance
Section titled “Execution Error Tolerance”Control whether the eval run halts on execution errors using execution.fail_on_error in the eval YAML:
execution: fail_on_error: false # never halt on errors (default) # fail_on_error: true # halt on first execution error| Value | Behavior |
|---|---|
true | Halt immediately on first execution error |
false | Continue despite errors (default) |
When halted, remaining tests are recorded with failureReasonCode: 'error_threshold_exceeded'. With concurrency > 1, a few additional tests may complete before halting takes effect.
Validate Before Running
Section titled “Validate Before Running”Check eval files for schema errors without executing:
agentv validate evals/my-eval.yamlRun a Single Assertion
Section titled “Run a Single Assertion”Run a code-grader assertion in isolation without executing a full eval suite:
agentv eval assert <name> --agent-output <text> --agent-input <text>The command discovers the assertion script by walking up directories looking for .agentv/graders/<name>.{ts,js,mts,mjs}, then passes the input via stdin and prints the result JSON to stdout.
# Run an assertion with inline argumentsagentv eval assert rouge-score \ --agent-output "The fox jumps over the lazy dog" \ --agent-input "Summarise the article"
# Or pass a JSON payload fileagentv eval assert rouge-score --file result.jsonThe --file option reads a JSON file with { "output": "...", "input": "..." } fields.
Exit codes: 0 if score >= 0.5 (pass), 1 if score < 0.5 (fail).
This is the same interface that agent-orchestrated evals use — the EVAL.yaml transpiler emits assert instructions for code graders so external grading agents can execute them directly.
Agent-Orchestrated Evals
Section titled “Agent-Orchestrated Evals”Run evaluations without API keys by letting an external agent (e.g., Claude Code, Copilot CLI) orchestrate the eval pipeline.
Overview
Section titled “Overview”agentv eval prompt eval --list evals/my-eval.yamlReturns JSON listing the available test_ids for the eval file.
Get Task Input
Section titled “Get Task Input”agentv eval prompt eval --input evals/my-eval.yaml --test-id case-123Returns JSON with:
input—[{role, content}]array. File references use absolute paths ({type: "file", path: "/abs/path"}) that the agent can read directly from the filesystem.guideline_paths— files containing additional instructions to prepend to the system message.criteria— grading criteria for the orchestrator’s reference (do not pass to the candidate).
Get Grading Context
Section titled “Get Grading Context”agentv eval prompt eval --expected-output evals/my-eval.yaml --test-id case-123Returns JSON with the data an external grader needs:
expected_output— reference assistant messagesreference_answer— flattened reference text when availablecriteria— high-level success criteriaassertions— evaluator configs for the test
Get Grading Brief
Section titled “Get Grading Brief”Output a human-readable summary of the grading criteria for a specific test, with type-prefixed assertion tags:
agentv eval prompt eval --grading-brief evals/my-eval.yaml --test-id case-123Example output:
Input: "Summarise the following article in one sentence."Expected: "The quick brown fox jumps over the lazy dog near the river bank."Criteria: - [code-grader] rouge-score: Measures n-gram recall and F1 - [llm-grader] Summary captures key points - [skill-trigger] should_trigger: true for summariserThis is useful for agents orchestrating evals to understand what criteria a test is evaluated against before running it.
When to Use
Section titled “When to Use”| Scenario | Command |
|---|---|
| Have API keys, want end-to-end automation | agentv eval |
| Run a single assertion in isolation | agentv eval assert <name> |
| No API keys, external agent can orchestrate the run | agentv eval prompt eval --list/--input/--expected-output |
| Inspect grading criteria before running | agentv eval prompt eval --grading-brief |
Version Requirements
Section titled “Version Requirements”Declare the minimum AgentV version needed by your eval project in .agentv/config.yaml:
required_version: ">=2.12.0"The value is a semver range using standard npm syntax (e.g., >=2.12.0, ^2.12.0, ~2.12, >=2.12.0 <3.0.0).
| Condition | Interactive (TTY) | Non-interactive (CI) |
|---|---|---|
| Version satisfies range | Runs silently | Runs silently |
| Version below range | Warns + prompts to continue | Warns to stderr, continues |
--strict flag + mismatch | Warns + exits 1 | Warns + exits 1 |
No required_version set | Runs silently | Runs silently |
| Malformed semver range | Error + exits 1 | Error + exits 1 |
Use --strict in CI pipelines to enforce version requirements:
agentv eval --strict evals/my-eval.yamlConfig File Defaults
Section titled “Config File Defaults”Set default execution options so you don’t have to pass them on every CLI invocation. Both .agentv/config.yaml and agentv.config.ts are supported.
YAML config (.agentv/config.yaml)
Section titled “YAML config (.agentv/config.yaml)”execution: verbose: true trace_file: .agentv/results/trace-{timestamp}.jsonl keep_workspaces: false otel_file: .agentv/results/otel-{timestamp}.json| Field | CLI equivalent | Type | Default | Description |
|---|---|---|---|---|
verbose | --verbose | boolean | false | Enable verbose logging |
trace_file | --trace-file | string | none | Write human-readable trace JSONL |
keep_workspaces | --keep-workspaces | boolean | false | Always keep temp workspaces after eval |
otel_file | --otel-file | string | none | Write OTLP JSON trace to file |
TypeScript config (agentv.config.ts)
Section titled “TypeScript config (agentv.config.ts)”import { defineConfig } from '@agentv/core';
export default defineConfig({ execution: { verbose: true, traceFile: '.agentv/results/trace-{timestamp}.jsonl', keepWorkspaces: false, otelFile: '.agentv/results/otel-{timestamp}.json', },});The {timestamp} placeholder is replaced with an ISO-like timestamp (e.g., 2026-03-05T14-30-00-000Z) at execution time.
Precedence: CLI flags > .agentv/config.yaml > agentv.config.ts > built-in defaults.
Environment Variables
Section titled “Environment Variables”AGENTV_HOME
Section titled “AGENTV_HOME”Override the default ~/.agentv directory for all global runtime data (workspaces, git cache, subagents, trace state, version check cache):
# Linux/macOSexport AGENTV_HOME=/data/agentv
# Windows (PowerShell)$env:AGENTV_HOME = "D:\agentv"
# Windows (CMD)set AGENTV_HOME=D:\agentvWhen set, AgentV logs Using AGENTV_HOME: <path> on startup to confirm the override is active.
All Options
Section titled “All Options”Run agentv eval --help for the full list of options including workers, timeouts, output formats, and trace dumping.