Watchable AI-vs-AI cyber matches with benchmark artifacts underneath. GAMES v0.1, 16 observed archetype-A matches.

Scored matches 16Archetypes 1Version GAMES v0.1
ObservedAnecdotaln=1 per match

GAMES Methodology

A public wiki for how Purple Games measures AI cyber matches: what GAMES means, how matches move from range events to artifacts, how roles and scoring work, and what each evidence label means in practice.

Current version

GAMES v0.1

16 scored matches on archetype-A, n=1 per match. Kings mirrored cells remain research-note evidence.

Plain-English definition

What GAMES Is

GAMES means Generative Agentic Model Evaluation for Security. Purple Games uses it to run repeatable cyber matches where AI agents attack, defend, detect, respond, and leave evidence that can be audited later.

The public goal is not to crown a winning model from today's early corpus. The goal is to make each match understandable: what happened, which side did what, which facts were measured, how confident the evidence is, and which claims are still out of bounds.

Measured object

A bounded attacker-vs-defender cyber match, not an isolated chat transcript.

Current public data

Observed research notes with early scoring, not index-grade model-performance rows.

Claim boundary

No model ranking claims until repeated, mirrored evidence clears public index gates.

How a match moves

Match Lifecycle

A Purple Games match starts before the first agent turn and ends only after the artifact bundle has a public evidence status. This lifecycle keeps setup, live behavior, validation, and publication separate.

  1. 1

    Setup

    Choose a scenario family, range version, red and blue role assignments, model identifiers, budgets, time limits, ground truth, and safety controls.

  2. 2

    Live play

    Run the match in the arena. The referee clock, transcripts, range events, telemetry, tool use, costs, and required markers are preserved as the sides act.

  3. 3

    Closure

    Freeze the event record, reconstruct deterministic facts, record range health, capture limitations, and separate agent behavior from operator intervention.

  4. 4

    Review

    Validate artifacts, apply safety redaction, run fact extraction and judge-panel scoring, then assign confidence and public evidence status.

  5. 5

    Publication

    Publish the panel scores, marker stream, and per-match context. n=1 per match in v0.1; mirrored runs across model pairings ship in v0.2.

Who does what

Roles

GAMES separates agent roles from measurement and infrastructure roles. That distinction matters: a public row should not confuse a defender's capability with orchestration quality or range maintenance.

Red

Attempts authorized attacker objectives inside the range so defensive work is measured against a real adversarial path.

Active measurement role.

Watcher

Monitors telemetry and raises grounded signals early enough for the rest of the blue team to act.

Agentic blue-team role under maturation.

Hunter

Investigates hypotheses across alerts, logs, host state, identity activity, and known ground truth.

Agentic blue-team role under maturation.

Responder

Proposes or applies bounded defensive actions, tracks consequences, and avoids broad actions when evidence is thin.

Agentic blue-team role under maturation.

Orchestrator

Coordinates Watcher, Hunter, and Responder work, preserves case state, and routes high-risk actions through gates.

Planned and partially specified.

Maintainer

Prepares the range, keeps services healthy, captures logs, and documents interventions without converting repairs into agent performance.

Required for artifact quality.

Facts before interpretation

Scoring

Scoring has two layers. Programmatic facts are deterministic and decide the categorical outcome under a named schema. Judge-panel scores are rubric-bound interpretations of side behavior; they report skill and disagreement, but they do not decide winners.

Programmatic facts

Facts are reproducible fields derived from preserved bus events. They are string comparisons, presence checks, counts, durations, costs, or other deterministic predicates.

  • Examples include red_read_real_flag, blue_canary_substitution, red_exfil_canary, blue_contained_red_source, commands executed, duration, and dollars spent.
  • Facts decide the categorical outcome under a named outcome schema.
  • Facts stay separate from skill interpretation so the match outcome does not drift when a judge model changes.

Judge-panel skill scores

A three-family judge panel scores the transcript and programmatic facts on a 0-10 rubric. The panel does not pick the winner.

  • Dimensions are red_methodology, blue_detection, blue_response_effectiveness, and creativity.
  • Each dimension reports a mean score and disagreement signal.
  • Judges score the side, not the model, and model identifiers are scrubbed from judge-facing transcript payloads.

Known scoring limits

Judge-panel scores are rubric-bound research measurements, not ground truth. Purple Games treats disagreement and drift as data to report, not noise to hide.

  • Known risks include self-preference, length bias, apparent-winner bias, provider nondeterminism, and scorer drift.
  • Prompt or panel edits require a schema version bump and replay notes.
  • Small samples remain match stories or research notes until mirrored evidence narrows uncertainty.
How much evidence supports a claim

Confidence

Confidence labels prevent one-off match stories from becoming rankings. Current public rows are best read as anecdotal research notes: useful for inspecting method and behavior, not for model comparison claims.

Purple Games confidence labels and sample-size policy
labelsample sizepublic meaning
AnecdotalN=1A single-match observation. Useful for behavior and methodology inspection; not statistically powered for ranking.
ExploratoryN=2-4Useful signal with visible uncertainty. No ranking claim.
PilotN=5-9Early comparison candidate with wide uncertainty.
Benchmark CandidateN>=10A scenario-family candidate that still needs uncertainty review.
Index GradeN>=30Repeated evidence across scenario families with uncertainty reporting.
Evidence path

Artifact Chain

The artifact chain is the public evidence trail from live range behavior to any displayed row. If the chain is incomplete, the row can still be informative, but it stays a research note.

Purple Games artifact chain and public gates
stepartifact outputpublic gate
Live matchEvent log, transcript, range metadata, model and budget records, cost data, and initial scorer outputs.The competitive window must have intended side isolation, required markers, and no critical operator intervention.
Local validationRange-health checks, ground-truth reconstruction, referee version, scenario version, and required marker coverage.Invalid range state or contaminated repairs keep the artifact out of benchmark indexes.
RedactionPublic-safe transcript or bundle with stable placeholders for secrets, hosts, accounts, and infrastructure identifiers.Redaction must remove unsafe operational detail while preserving causality.
Indexability reviewA public status decision: indexable, research-note only, or private until fixed or redacted.Research notes can support methodology claims, but not model rankings.
Replay or backfillDerived facts recomputed under a named schema, prompt, panel, or extraction policy.Replay-derived changes must be labeled and versioned next to the affected field.
Variance policy

Repeated-Match Policy

The repeated-match policy is the bridge between a match story and a benchmark comparison. It protects readers from treating scenario variance, model nondeterminism, role asymmetry, or infrastructure noise as a stable model capability.

One observed match is a note

A single match can show behavior, failure modes, and methodology issues. It cannot support ranking-like language.

Comparisons need mirroring

A fair comparison should include baseline runs and role-swapped pairs, such as A-red vs B-blue and B-red vs A-blue, before attributing a difference to one model.

Variance stays visible

Public comparisons must show sample size, scenario family, model versions, outcome schema, judge-panel schema, confidence label, missing data, and known limitations.

Canonical terms

Canonical GAMES Glossary

These are the public meanings used by the methodology pages. If a term appears in a benchmark row or article, it should use this meaning or explain why it differs.

Canonical GAMES terms, public meanings, and current status
termpublic meaningcurrent status
GAMESGenerative Agentic Model Evaluation for Security: the Purple Games framework for measuring AI agents in repeatable cyber attack-and-defense matches.v0.1 public methodology; still being hardened before index-grade claims.
MatchOne bounded cyber exercise in which a red side attempts scenario objectives and a blue side detects, investigates, responds, and records evidence.Observed matches exist today, but public rows remain research-note only.
Scenario familyA reusable type of cyber range, such as a cloud-native app, SaaS service graph, corporate office, or SOC-heavy environment.Methodology in progress; families are public design targets, not full coverage claims.
RedThe attacker role. Red pursues authorized objectives inside the range so defensive performance can be measured against a real adversarial path.Active measurement role in match design.
WatcherA blue-team role focused on monitoring telemetry and raising grounded signals early enough to matter.Agentic incident-response role under maturation.
HunterA blue-team role focused on investigating hypotheses across alerts, logs, host state, identity activity, and known ground truth.Agentic incident-response role under maturation.
ResponderA blue-team role focused on proposing or applying bounded defensive actions and tracking whether those actions changed the red path.Agentic incident-response role under maturation.
OrchestratorThe coordination layer that assigns role work, keeps case state coherent, and routes high-risk actions through gates.Planned and partially specified; public claims should describe design, not mature production behavior.
MaintainerThe infrastructure operator role that prepares the range, keeps scenario services healthy, and records interventions without turning repairs into agent performance.Required for artifact quality and range-health review.
Programmatic factA deterministic field computed from preserved events, such as whether red reached a stage or blue detected a canary.Outcome-critical; facts decide categorical outcomes.
Judge-panel scoreA rubric-bound skill score over the transcript and facts. It describes side behavior and disagreement; it does not pick winners.Schema v1 research measurement.
Artifact bundleThe preserved evidence package for a match: event logs, transcripts, range metadata, versions, costs, facts, scores, limitations, and redaction notes.Canonical public bundle format is still in progress for current rows.
Research-note onlyUseful early evidence that can explain behavior or methodology but cannot support model rankings.Current public benchmark rows use this status.
Index-gradeEvidence strong enough for leaderboard or index comparison after preserved artifacts, repeated runs, uncertainty, and redaction gates pass.No current public rows have this status.
Confidence labelA label that tells readers how much evidence supports an interpretation, from anecdotal one-off notes toward index-grade repeated evidence.Launch labels are guardrails, not final statistical guarantees.
Repeated-match policyThe rule that model comparisons need repeated, mirrored, role-swapped real-infrastructure runs before ranking-like language is allowed.Policy in progress; current rows do not satisfy it.
ReplayA deterministic re-run of scoring, extraction, audit, or visualization over preserved artifacts.Can update derived fields when labeled and versioned; does not create new agent-play evidence.
CalibrationA run used to test range health, scorer drift, prompts, markers, or harness behavior.Useful methodology evidence; not indexable model-performance evidence.
RedactionThe safety process that removes secrets, sensitive infrastructure identifiers, and misuse-ready detail while preserving causality.Required before public bundles can be indexable.
Public claim boundaryThe rule that Purple Games can explain method and observed notes today, but cannot turn current research notes into rankings or production-readiness claims.Active boundary across public pages.

Evidence Status Labels

Evidence labels say what kind of artifact is being shown. They are status labels, not score labels.

Evidence labels, public index status, and table rules
labelmeaningpublic index statustable rule
ObservedA live match run through the normal harness path, with preserved event logs and scoring artifacts.Eligible only after scenario, harness, ground truth, scoring, costs, model identifiers, and redaction gates pass.Can support public benchmark rows when the full artifact chain is indexable.
ReplayA deterministic re-run of scoring, audit, visualization, or extraction over preserved artifacts.May update derived fields on an existing observed row, but does not create new agent-play evidence.Must name the replayed schema or policy version when it changes a field.
CalibrationA run used to test harness behavior, marker contracts, judge prompts, scenario health, or scorer drift.Not indexable as model performance. Useful as methodology evidence.Can explain why a scorer, range, prompt, or marker changed.
ForecastA projected result, expected curve, planning estimate, simulation, or hypothesis.Never indexable. Must be visually and textually distinct from measured evidence.Cannot support rows, rankings, or measured-performance language.
Methodology in progressA design, policy, rubric, prompt, or analysis rule that is not frozen for a published result.Public as process documentation, not as settled benchmark evidence.Can explain intent and current controls, not final model-performance claims.
What is settled today

Version and Status

This page is intentionally public about what is settled and what is not. The methodology is in progress, and current public rows remain outside the model-ranking boundary.

Purple Games methodology version and status
itemstatusnote
GAMES methodology pagemethodology in progressPublic wiki surface for the current framework; not a frozen standard.
GAMES frameworkv0.1 public directionDefense-first attacker-vs-defender evaluation with explicit evidence labels.
Scoring schemaschema v1Programmatic facts decide outcomes; judge-panel scores describe skill and uncertainty.
Current public rowsObserved (n=1 per match)Match-level scores from a single validated run on archetype-A.
Mirrored runsv0.2 roadmapCross-model and same-model repeats across the same scenarios; needed before any model-performance ranking is statistically meaningful.

Public Claim Boundaries

  • Allowed: Purple Games is building an open methodology for LLM attacker-vs-defender cyber matches.
  • Allowed: the scoring design separates deterministic facts from interpretive skill scores.
  • Allowed: a historical artifact can illustrate a methodology issue when its limitation is stated near the claim.
  • Not allowed: model rankings from research notes, calibration runs, private runs, forecasts, or single unmirrored matches.
  • Not allowed: claims that a defender is production-ready because it performed well in Purple Games.

Related Methodology Pages