Scoring

Purple Games scoring is hybrid: deterministic programmatic facts determine outcomes, while an independent judge panel scores skill.

schema v1

source boundary

Reflects scoring methodology v1 and the judge prompt implemented in harness/referee/judges.

Programmatic facts

Facts are reproducible fields derived from preserved bus events. They are string comparisons, presence checks, counts, durations, costs, or other deterministic predicates.

Examples include red_read_real_flag, blue_canary_substitution, red_exfil_canary, blue_contained_red_source, commands executed, duration, and dollars spent.
Facts decide the categorical outcome under a named outcome schema.
Facts stay separate from skill interpretation so the match outcome does not drift when a judge model changes.

Judge-panel skill scores

A three-family judge panel scores the transcript and programmatic facts on a 0-10 rubric. The panel does not pick the winner.

Dimensions are red_methodology, blue_detection, blue_response_effectiveness, and creativity.
Each dimension reports a mean score and disagreement signal.
Judges score the side, not the model, and model identifiers are scrubbed from judge-facing transcript payloads.

Known scoring limits

Judge-panel scores are rubric-bound research measurements, not ground truth. Purple Games treats disagreement and drift as data to report, not noise to hide.

Known risks include self-preference, length bias, apparent-winner bias, provider nondeterminism, and scorer drift.
Prompt or panel edits require a schema version bump and replay notes.
Small samples remain match stories or research notes until mirrored evidence narrows uncertainty.

methodology pages

GAMES Frameworkv0.1 direction Match Mechanicsmethodology in progress Scoringschema v1 Confidencemethodology in progress Scenario Familiesmethodology in progress Agentic Incident Responsemethodology in progress