Methodology
Research documentation for the Light Research Atlas. Purple Games measures LLM attacker-vs-defender cyber matches with defense as the primary goal, then separates observed facts, judge-panel interpretations, confidence, and public claim boundaries.
current evidence
The public corpus is real-match evidence, but current rows remain research notes until canonical bundles, redaction, and replication gates are complete.
scoring boundary
Programmatic facts determine outcome categories. Judge-panel scores describe side behavior and uncertainty; they do not decide winners.
claim posture
Methodology pages explain how claims become possible later. They do not convert current research-note rows into model rankings.
public claim boundary
The public site can describe the research program, the current scoring design, and known limitations. It cannot turn forecasts, calibration runs, private artifacts, or research-note rows into leaderboard claims. If a sentence could be read as a model ranking, it needs indexable observed evidence and visible uncertainty.
Rows on /benchmarks are research-note only while canonical bundles, redaction, and indexability hardening remain in progress.
methodology index
evidence labels
| label | meaning | public index status | table rule |
|---|---|---|---|
| Observed | A live match run through the normal harness path, with preserved event logs and scoring artifacts. | Eligible only after scenario, harness, ground truth, scoring, costs, model identifiers, and redaction gates pass. | Can support public benchmark rows when the full artifact chain is indexable. |
| Replay | A deterministic re-run of scoring, audit, visualization, or extraction over preserved artifacts. | May update derived fields on an existing observed row, but does not create new agent-play evidence. | Must name the replayed schema or policy version when it changes a field. |
| Calibration | A run used to test harness behavior, marker contracts, judge prompts, scenario health, or scorer drift. | Not indexable as model performance. Useful as methodology evidence. | Can explain why a scorer, range, prompt, or marker changed. |
| Forecast | A projected result, expected curve, planning estimate, simulation, or hypothesis. | Never indexable. Must be visually and textually distinct from measured evidence. | Cannot support rows, rankings, or measured-performance language. |
| Methodology in progress | A design, policy, rubric, prompt, or analysis rule that is not frozen for a published result. | Public as process documentation, not as settled benchmark evidence. | Can explain intent and current controls, not final model-performance claims. |
artifact lifecycle
| step | artifact output | public gate |
|---|---|---|
| Live match | Event log, transcript, range metadata, model and budget records, cost data, and initial scorer outputs. | The competitive window must have intended side isolation, required markers, and no critical operator intervention. |
| Local validation | Range-health checks, ground-truth reconstruction, referee version, scenario version, and required marker coverage. | Invalid range state or contaminated repairs keep the artifact out of benchmark indexes. |
| Redaction | Public-safe transcript or bundle with stable placeholders for secrets, hosts, accounts, and infrastructure identifiers. | Redaction must remove unsafe operational detail while preserving causality. |
| Indexability review | A public status decision: indexable, research-note only, or private until fixed or redacted. | Research notes can support methodology claims, but not model rankings. |
| Replay or backfill | Derived facts recomputed under a named schema, prompt, panel, or extraction policy. | Replay-derived changes must be labeled and versioned next to the affected field. |
scoring schema
Programmatic facts
Facts are reproducible fields derived from preserved bus events. They are string comparisons, presence checks, counts, durations, costs, or other deterministic predicates.
- Examples include red_read_real_flag, blue_canary_substitution, red_exfil_canary, blue_contained_red_source, commands executed, duration, and dollars spent.
- Facts decide the categorical outcome under a named outcome schema.
- Facts stay separate from skill interpretation so the match outcome does not drift when a judge model changes.
Judge-panel skill scores
A three-family judge panel scores the transcript and programmatic facts on a 0-10 rubric. The panel does not pick the winner.
- Dimensions are red_methodology, blue_detection, blue_response_effectiveness, and creativity.
- Each dimension reports a mean score and disagreement signal.
- Judges score the side, not the model, and model identifiers are scrubbed from judge-facing transcript payloads.
Known scoring limits
Judge-panel scores are rubric-bound research measurements, not ground truth. Purple Games treats disagreement and drift as data to report, not noise to hide.
- Known risks include self-preference, length bias, apparent-winner bias, provider nondeterminism, and scorer drift.
- Prompt or panel edits require a schema version bump and replay notes.
- Small samples remain match stories or research notes until mirrored evidence narrows uncertainty.
indexability and public claims
Indexability rules
- Observed rows need match.start and match.end, identifiable scenario and harness versions, preserved event logs, reproducible facts, and public-safe redaction.
- Calibration, forecast, contaminated, private, or incomplete artifacts cannot become model-performance rows.
- Unknown cells remain unavailable. They are not filled with expected values, illustrative placeholders, or synthetic representatives.
- Current benchmark rows are research-note only while canonical bundles, redaction, and indexability hardening are still in progress.
Public claim boundaries
- Allowed: Purple Games is building an open methodology for LLM attacker-vs-defender cyber matches.
- Allowed: the scoring design separates deterministic facts from interpretive skill scores.
- Allowed: a historical artifact can illustrate a methodology issue when its limitation is stated near the claim.
- Not allowed: model rankings from research notes, calibration runs, private runs, forecasts, or single unmirrored matches.
- Not allowed: claims that a defender is production-ready because it performed well in Purple Games.
safety and redaction
- Remove credentials, tokens, API keys, cookies, SSH keys, session identifiers, and provider resource IDs.
- Transform real IP addresses, hostnames, account names, email addresses, billing metadata, and cloud project identifiers unless intentionally synthetic.
- Paraphrase exploit strings or command sequences when the exact text would materially improve misuse outside the benchmark.
- Use stable placeholders such as REDACTED_TOKEN_1 or SCENARIO_HOST_A so readers can follow causality.
- Review screenshots and videos frame by frame for terminal history, dashboards, logs, browser bars, secrets, and account metadata.
benchmark table schema
| column | purpose |
|---|---|
| Match ID | Stable artifact reference for a live match or public research note. |
| Evidence label | Observed, Replay, Calibration, Forecast, or Methodology in progress. Replay-derived fields name the replay schema next to the field. |
| Scenario | Scenario or reusable archetype name, with version when available. |
| Red model | Attacker model identifier or a stated anonymization/unavailable note. |
| Blue model | Defender model identifier or a stated anonymization/unavailable note. |
| Outcome | Deterministic categorical result under a named outcome schema. |
| Outcome schema | Version of the deterministic fact and outcome contract used. |
| Skill scores | Judge-panel means and disagreement, if available. |
| Judge-panel schema | Panel composition and prompt/schema version for any skill scores shown. |
| Sample size | Number of observed runs behind a row or aggregate. |
| Missing data | Unavailable fields marked explicitly without implying a zero score. |
| Limitations | Known blockers, caveats, or reasons a row is research-note only. |
| Attribution confidence | Confidence that the result reflects model capability rather than noise. |
| Public status | Indexable, research-note only, or private summary. |