GAMES Methodology
A public wiki for how Purple Games measures AI cyber matches: what GAMES means, how matches move from range events to artifacts, how roles and scoring work, and what each evidence label means in practice.
GAMES v0.1
16 scored matches on archetype-A, n=1 per match. Kings mirrored cells remain research-note evidence.
What GAMES Is
GAMES means Generative Agentic Model Evaluation for Security. Purple Games uses it to run repeatable cyber matches where AI agents attack, defend, detect, respond, and leave evidence that can be audited later.
The public goal is not to crown a winning model from today's early corpus. The goal is to make each match understandable: what happened, which side did what, which facts were measured, how confident the evidence is, and which claims are still out of bounds.
Measured object
A bounded attacker-vs-defender cyber match, not an isolated chat transcript.
Current public data
Observed research notes with early scoring, not index-grade model-performance rows.
Claim boundary
No model ranking claims until repeated, mirrored evidence clears public index gates.
Match Lifecycle
A Purple Games match starts before the first agent turn and ends only after the artifact bundle has a public evidence status. This lifecycle keeps setup, live behavior, validation, and publication separate.
- 1
Setup
Choose a scenario family, range version, red and blue role assignments, model identifiers, budgets, time limits, ground truth, and safety controls.
- 2
Live play
Run the match in the arena. The referee clock, transcripts, range events, telemetry, tool use, costs, and required markers are preserved as the sides act.
- 3
Closure
Freeze the event record, reconstruct deterministic facts, record range health, capture limitations, and separate agent behavior from operator intervention.
- 4
Review
Validate artifacts, apply safety redaction, run fact extraction and judge-panel scoring, then assign confidence and public evidence status.
- 5
Publication
Publish the panel scores, marker stream, and per-match context. n=1 per match in v0.1; mirrored runs across model pairings ship in v0.2.
Roles
GAMES separates agent roles from measurement and infrastructure roles. That distinction matters: a public row should not confuse a defender's capability with orchestration quality or range maintenance.
Red
Attempts authorized attacker objectives inside the range so defensive work is measured against a real adversarial path.
Active measurement role.
Watcher
Monitors telemetry and raises grounded signals early enough for the rest of the blue team to act.
Agentic blue-team role under maturation.
Hunter
Investigates hypotheses across alerts, logs, host state, identity activity, and known ground truth.
Agentic blue-team role under maturation.
Responder
Proposes or applies bounded defensive actions, tracks consequences, and avoids broad actions when evidence is thin.
Agentic blue-team role under maturation.
Orchestrator
Coordinates Watcher, Hunter, and Responder work, preserves case state, and routes high-risk actions through gates.
Planned and partially specified.
Maintainer
Prepares the range, keeps services healthy, captures logs, and documents interventions without converting repairs into agent performance.
Required for artifact quality.
Scoring
Scoring has two layers. Programmatic facts are deterministic and decide the categorical outcome under a named schema. Judge-panel scores are rubric-bound interpretations of side behavior; they report skill and disagreement, but they do not decide winners.
Programmatic facts
Facts are reproducible fields derived from preserved bus events. They are string comparisons, presence checks, counts, durations, costs, or other deterministic predicates.
- Examples include red_read_real_flag, blue_canary_substitution, red_exfil_canary, blue_contained_red_source, commands executed, duration, and dollars spent.
- Facts decide the categorical outcome under a named outcome schema.
- Facts stay separate from skill interpretation so the match outcome does not drift when a judge model changes.
Judge-panel skill scores
A three-family judge panel scores the transcript and programmatic facts on a 0-10 rubric. The panel does not pick the winner.
- Dimensions are red_methodology, blue_detection, blue_response_effectiveness, and creativity.
- Each dimension reports a mean score and disagreement signal.
- Judges score the side, not the model, and model identifiers are scrubbed from judge-facing transcript payloads.
Known scoring limits
Judge-panel scores are rubric-bound research measurements, not ground truth. Purple Games treats disagreement and drift as data to report, not noise to hide.
- Known risks include self-preference, length bias, apparent-winner bias, provider nondeterminism, and scorer drift.
- Prompt or panel edits require a schema version bump and replay notes.
- Small samples remain match stories or research notes until mirrored evidence narrows uncertainty.
Confidence
Confidence labels prevent one-off match stories from becoming rankings. Current public rows are best read as anecdotal research notes: useful for inspecting method and behavior, not for model comparison claims.
| label | sample size | public meaning |
|---|---|---|
| Anecdotal | N=1 | A single-match observation. Useful for behavior and methodology inspection; not statistically powered for ranking. |
| Exploratory | N=2-4 | Useful signal with visible uncertainty. No ranking claim. |
| Pilot | N=5-9 | Early comparison candidate with wide uncertainty. |
| Benchmark Candidate | N>=10 | A scenario-family candidate that still needs uncertainty review. |
| Index Grade | N>=30 | Repeated evidence across scenario families with uncertainty reporting. |
Artifact Chain
The artifact chain is the public evidence trail from live range behavior to any displayed row. If the chain is incomplete, the row can still be informative, but it stays a research note.
| step | artifact output | public gate |
|---|---|---|
| Live match | Event log, transcript, range metadata, model and budget records, cost data, and initial scorer outputs. | The competitive window must have intended side isolation, required markers, and no critical operator intervention. |
| Local validation | Range-health checks, ground-truth reconstruction, referee version, scenario version, and required marker coverage. | Invalid range state or contaminated repairs keep the artifact out of benchmark indexes. |
| Redaction | Public-safe transcript or bundle with stable placeholders for secrets, hosts, accounts, and infrastructure identifiers. | Redaction must remove unsafe operational detail while preserving causality. |
| Indexability review | A public status decision: indexable, research-note only, or private until fixed or redacted. | Research notes can support methodology claims, but not model rankings. |
| Replay or backfill | Derived facts recomputed under a named schema, prompt, panel, or extraction policy. | Replay-derived changes must be labeled and versioned next to the affected field. |
Repeated-Match Policy
The repeated-match policy is the bridge between a match story and a benchmark comparison. It protects readers from treating scenario variance, model nondeterminism, role asymmetry, or infrastructure noise as a stable model capability.
One observed match is a note
A single match can show behavior, failure modes, and methodology issues. It cannot support ranking-like language.
Comparisons need mirroring
A fair comparison should include baseline runs and role-swapped pairs, such as A-red vs B-blue and B-red vs A-blue, before attributing a difference to one model.
Variance stays visible
Public comparisons must show sample size, scenario family, model versions, outcome schema, judge-panel schema, confidence label, missing data, and known limitations.
Canonical GAMES Glossary
These are the public meanings used by the methodology pages. If a term appears in a benchmark row or article, it should use this meaning or explain why it differs.
| term | public meaning | current status |
|---|---|---|
| GAMES | Generative Agentic Model Evaluation for Security: the Purple Games framework for measuring AI agents in repeatable cyber attack-and-defense matches. | v0.1 public methodology; still being hardened before index-grade claims. |
| Match | One bounded cyber exercise in which a red side attempts scenario objectives and a blue side detects, investigates, responds, and records evidence. | Observed matches exist today, but public rows remain research-note only. |
| Scenario family | A reusable type of cyber range, such as a cloud-native app, SaaS service graph, corporate office, or SOC-heavy environment. | Methodology in progress; families are public design targets, not full coverage claims. |
| Red | The attacker role. Red pursues authorized objectives inside the range so defensive performance can be measured against a real adversarial path. | Active measurement role in match design. |
| Watcher | A blue-team role focused on monitoring telemetry and raising grounded signals early enough to matter. | Agentic incident-response role under maturation. |
| Hunter | A blue-team role focused on investigating hypotheses across alerts, logs, host state, identity activity, and known ground truth. | Agentic incident-response role under maturation. |
| Responder | A blue-team role focused on proposing or applying bounded defensive actions and tracking whether those actions changed the red path. | Agentic incident-response role under maturation. |
| Orchestrator | The coordination layer that assigns role work, keeps case state coherent, and routes high-risk actions through gates. | Planned and partially specified; public claims should describe design, not mature production behavior. |
| Maintainer | The infrastructure operator role that prepares the range, keeps scenario services healthy, and records interventions without turning repairs into agent performance. | Required for artifact quality and range-health review. |
| Programmatic fact | A deterministic field computed from preserved events, such as whether red reached a stage or blue detected a canary. | Outcome-critical; facts decide categorical outcomes. |
| Judge-panel score | A rubric-bound skill score over the transcript and facts. It describes side behavior and disagreement; it does not pick winners. | Schema v1 research measurement. |
| Artifact bundle | The preserved evidence package for a match: event logs, transcripts, range metadata, versions, costs, facts, scores, limitations, and redaction notes. | Canonical public bundle format is still in progress for current rows. |
| Research-note only | Useful early evidence that can explain behavior or methodology but cannot support model rankings. | Current public benchmark rows use this status. |
| Index-grade | Evidence strong enough for leaderboard or index comparison after preserved artifacts, repeated runs, uncertainty, and redaction gates pass. | No current public rows have this status. |
| Confidence label | A label that tells readers how much evidence supports an interpretation, from anecdotal one-off notes toward index-grade repeated evidence. | Launch labels are guardrails, not final statistical guarantees. |
| Repeated-match policy | The rule that model comparisons need repeated, mirrored, role-swapped real-infrastructure runs before ranking-like language is allowed. | Policy in progress; current rows do not satisfy it. |
| Replay | A deterministic re-run of scoring, extraction, audit, or visualization over preserved artifacts. | Can update derived fields when labeled and versioned; does not create new agent-play evidence. |
| Calibration | A run used to test range health, scorer drift, prompts, markers, or harness behavior. | Useful methodology evidence; not indexable model-performance evidence. |
| Redaction | The safety process that removes secrets, sensitive infrastructure identifiers, and misuse-ready detail while preserving causality. | Required before public bundles can be indexable. |
| Public claim boundary | The rule that Purple Games can explain method and observed notes today, but cannot turn current research notes into rankings or production-readiness claims. | Active boundary across public pages. |
Evidence Status Labels
Evidence labels say what kind of artifact is being shown. They are status labels, not score labels.
| label | meaning | public index status | table rule |
|---|---|---|---|
| Observed | A live match run through the normal harness path, with preserved event logs and scoring artifacts. | Eligible only after scenario, harness, ground truth, scoring, costs, model identifiers, and redaction gates pass. | Can support public benchmark rows when the full artifact chain is indexable. |
| Replay | A deterministic re-run of scoring, audit, visualization, or extraction over preserved artifacts. | May update derived fields on an existing observed row, but does not create new agent-play evidence. | Must name the replayed schema or policy version when it changes a field. |
| Calibration | A run used to test harness behavior, marker contracts, judge prompts, scenario health, or scorer drift. | Not indexable as model performance. Useful as methodology evidence. | Can explain why a scorer, range, prompt, or marker changed. |
| Forecast | A projected result, expected curve, planning estimate, simulation, or hypothesis. | Never indexable. Must be visually and textually distinct from measured evidence. | Cannot support rows, rankings, or measured-performance language. |
| Methodology in progress | A design, policy, rubric, prompt, or analysis rule that is not frozen for a published result. | Public as process documentation, not as settled benchmark evidence. | Can explain intent and current controls, not final model-performance claims. |
Version and Status
This page is intentionally public about what is settled and what is not. The methodology is in progress, and current public rows remain outside the model-ranking boundary.
| item | status | note |
|---|---|---|
| GAMES methodology page | methodology in progress | Public wiki surface for the current framework; not a frozen standard. |
| GAMES framework | v0.1 public direction | Defense-first attacker-vs-defender evaluation with explicit evidence labels. |
| Scoring schema | schema v1 | Programmatic facts decide outcomes; judge-panel scores describe skill and uncertainty. |
| Current public rows | Observed (n=1 per match) | Match-level scores from a single validated run on archetype-A. |
| Mirrored runs | v0.2 roadmap | Cross-model and same-model repeats across the same scenarios; needed before any model-performance ranking is statistically meaningful. |
Public Claim Boundaries
- Allowed: Purple Games is building an open methodology for LLM attacker-vs-defender cyber matches.
- Allowed: the scoring design separates deterministic facts from interpretive skill scores.
- Allowed: a historical artifact can illustrate a methodology issue when its limitation is stated near the claim.
- Not allowed: model rankings from research notes, calibration runs, private runs, forecasts, or single unmirrored matches.
- Not allowed: claims that a defender is production-ready because it performed well in Purple Games.