Datasets and Evaluation¶
Generate A Dataset¶
ghostlab generate-dataset \
--profile runs/<id>-inspect/capabilities.json \
--personas 3 \
--scenarios-per-persona 3 \
--seed 7 \
--name cortex
This writes a self-contained directory:
datasets/cortex/
dataset.json
personas/<id>.json
scenarios/<id>.json
The manifest records cases, seed, MCP identity, and status. The seed governs case ordering for reproducibility.
Curate Before Running¶
ghostlab review-dataset \
--dataset datasets/cortex \
--profile runs/<id>-inspect/capabilities.json
The review command writes review.md and review.json with coverage, previews, and flags. You can edit dataset.json directly or use --approve, --reject, and --needs-edit.
Evaluate Runs¶
ghostlab evaluate --run runs/<id> --capabilities runs/<id>-inspect/capabilities.json
Evaluation combines deterministic checks with a Codex LLM judge:
- Failed tool calls.
- Tool efficiency: total calls, unique tools, redundant calls (same tool with identical arguments), and per-call latency when the capture provides it.
- Expected-tool coverage from
exercises. - Success criteria met or unmet.
- Failure signals triggered or avoided.
- Claimed tools not exposed by the server, when capabilities are supplied.
- Golden assertions from a scenario's optional
expected_outcome.
The command writes verdict.json and verdict.md.
Golden assertions¶
For scenarios with a known-correct answer, add an expected_outcome block to the
scenario JSON for objective, judge-independent grading. A mismatch is a hard gate
that forces an overall fail:
"expected_outcome": {
"must_include": ["band score", "7.5"],
"must_not_include": ["error"],
"expected_tool_args": [
{ "tool": "student_get_status", "arguments": { "id": "u1" } }
]
}
must_include / must_not_include are case-insensitive substrings checked
against the final assistant turn. Each expected_tool_args entry passes when the
run contains a call to that tool whose arguments include the given key/value pairs
(a subset match). The LLM judge still scores the open-ended success_criteria.
Evaluate A Dataset¶
ghostlab run-dataset --dataset datasets/cortex \
--target targets/cortex-local.json \
--aut-runner runners/codex-cortex-local-session.json \
--evaluate --capabilities runs/<id>-inspect/capabilities.json
Per-case verdicts are written into each run directory and aggregated into the dataset summary.
Scorecard A Dataset Run¶
Roll a whole dataset run up into one MCP validation report:
ghostlab scorecard --results runs/<id>-summary
It reads each case's run directory (verdict, critique, and tool calls when
present) and aggregates server-level signals — pass rate, average tool coverage,
average tool-ergonomics score, per-tool failure rates, hallucinated-tool and
golden-mismatch counts, efficiency, and recurring tool-design recommendations —
into scorecard.json and scorecard.md. Run evaluate and critique on the
cases first for the richest report.
Compare Dataset Runs¶
ghostlab compare --base runs/<base>-summary --candidate runs/<candidate>-summary \
--output comparison.md
Comparison reports regressions first, then fixes, then other verdict or status changes. It exits non-zero when regressions are found so it can gate CI.