Public snapshot 9 real matchesPublication status 8 scored research notesIndex-grade rows 0

Methodology

Research documentation for the Light Research Atlas. Purple Games measures LLM attacker-vs-defender cyber matches with defense as the primary goal, then separates observed facts, judge-panel interpretations, confidence, and public claim boundaries.

public status methodology in progress

current evidence

The public corpus is real-match evidence, but current rows remain research notes until canonical bundles, redaction, and replication gates are complete.

scoring boundary

Programmatic facts determine outcome categories. Judge-panel scores describe side behavior and uncertainty; they do not decide winners.

claim posture

Methodology pages explain how claims become possible later. They do not convert current research-note rows into model rankings.

public claim boundary

The public site can describe the research program, the current scoring design, and known limitations. It cannot turn forecasts, calibration runs, private artifacts, or research-note rows into leaderboard claims. If a sentence could be read as a model ranking, it needs indexable observed evidence and visible uncertainty.

current rows

Rows on /benchmarks are research-note only while canonical bundles, redaction, and indexability hardening remain in progress.

methodology index

evidence labels

labelmeaningpublic index statustable rule
ObservedA live match run through the normal harness path, with preserved event logs and scoring artifacts.Eligible only after scenario, harness, ground truth, scoring, costs, model identifiers, and redaction gates pass.Can support public benchmark rows when the full artifact chain is indexable.
ReplayA deterministic re-run of scoring, audit, visualization, or extraction over preserved artifacts.May update derived fields on an existing observed row, but does not create new agent-play evidence.Must name the replayed schema or policy version when it changes a field.
CalibrationA run used to test harness behavior, marker contracts, judge prompts, scenario health, or scorer drift.Not indexable as model performance. Useful as methodology evidence.Can explain why a scorer, range, prompt, or marker changed.
ForecastA projected result, expected curve, planning estimate, simulation, or hypothesis.Never indexable. Must be visually and textually distinct from measured evidence.Cannot support rows, rankings, or measured-performance language.
Methodology in progressA design, policy, rubric, prompt, or analysis rule that is not frozen for a published result.Public as process documentation, not as settled benchmark evidence.Can explain intent and current controls, not final model-performance claims.

artifact lifecycle

stepartifact outputpublic gate
Live matchEvent log, transcript, range metadata, model and budget records, cost data, and initial scorer outputs.The competitive window must have intended side isolation, required markers, and no critical operator intervention.
Local validationRange-health checks, ground-truth reconstruction, referee version, scenario version, and required marker coverage.Invalid range state or contaminated repairs keep the artifact out of benchmark indexes.
RedactionPublic-safe transcript or bundle with stable placeholders for secrets, hosts, accounts, and infrastructure identifiers.Redaction must remove unsafe operational detail while preserving causality.
Indexability reviewA public status decision: indexable, research-note only, or private until fixed or redacted.Research notes can support methodology claims, but not model rankings.
Replay or backfillDerived facts recomputed under a named schema, prompt, panel, or extraction policy.Replay-derived changes must be labeled and versioned next to the affected field.

scoring schema

Programmatic facts

Facts are reproducible fields derived from preserved bus events. They are string comparisons, presence checks, counts, durations, costs, or other deterministic predicates.

  • Examples include red_read_real_flag, blue_canary_substitution, red_exfil_canary, blue_contained_red_source, commands executed, duration, and dollars spent.
  • Facts decide the categorical outcome under a named outcome schema.
  • Facts stay separate from skill interpretation so the match outcome does not drift when a judge model changes.

Judge-panel skill scores

A three-family judge panel scores the transcript and programmatic facts on a 0-10 rubric. The panel does not pick the winner.

  • Dimensions are red_methodology, blue_detection, blue_response_effectiveness, and creativity.
  • Each dimension reports a mean score and disagreement signal.
  • Judges score the side, not the model, and model identifiers are scrubbed from judge-facing transcript payloads.

Known scoring limits

Judge-panel scores are rubric-bound research measurements, not ground truth. Purple Games treats disagreement and drift as data to report, not noise to hide.

  • Known risks include self-preference, length bias, apparent-winner bias, provider nondeterminism, and scorer drift.
  • Prompt or panel edits require a schema version bump and replay notes.
  • Small samples remain match stories or research notes until mirrored evidence narrows uncertainty.

indexability and public claims

Indexability rules

  • Observed rows need match.start and match.end, identifiable scenario and harness versions, preserved event logs, reproducible facts, and public-safe redaction.
  • Calibration, forecast, contaminated, private, or incomplete artifacts cannot become model-performance rows.
  • Unknown cells remain unavailable. They are not filled with expected values, illustrative placeholders, or synthetic representatives.
  • Current benchmark rows are research-note only while canonical bundles, redaction, and indexability hardening are still in progress.

Public claim boundaries

  • Allowed: Purple Games is building an open methodology for LLM attacker-vs-defender cyber matches.
  • Allowed: the scoring design separates deterministic facts from interpretive skill scores.
  • Allowed: a historical artifact can illustrate a methodology issue when its limitation is stated near the claim.
  • Not allowed: model rankings from research notes, calibration runs, private runs, forecasts, or single unmirrored matches.
  • Not allowed: claims that a defender is production-ready because it performed well in Purple Games.

safety and redaction

  • Remove credentials, tokens, API keys, cookies, SSH keys, session identifiers, and provider resource IDs.
  • Transform real IP addresses, hostnames, account names, email addresses, billing metadata, and cloud project identifiers unless intentionally synthetic.
  • Paraphrase exploit strings or command sequences when the exact text would materially improve misuse outside the benchmark.
  • Use stable placeholders such as REDACTED_TOKEN_1 or SCENARIO_HOST_A so readers can follow causality.
  • Review screenshots and videos frame by frame for terminal history, dashboards, logs, browser bars, secrets, and account metadata.

benchmark table schema

columnpurpose
Match IDStable artifact reference for a live match or public research note.
Evidence labelObserved, Replay, Calibration, Forecast, or Methodology in progress. Replay-derived fields name the replay schema next to the field.
ScenarioScenario or reusable archetype name, with version when available.
Red modelAttacker model identifier or a stated anonymization/unavailable note.
Blue modelDefender model identifier or a stated anonymization/unavailable note.
OutcomeDeterministic categorical result under a named outcome schema.
Outcome schemaVersion of the deterministic fact and outcome contract used.
Skill scoresJudge-panel means and disagreement, if available.
Judge-panel schemaPanel composition and prompt/schema version for any skill scores shown.
Sample sizeNumber of observed runs behind a row or aggregate.
Missing dataUnavailable fields marked explicitly without implying a zero score.
LimitationsKnown blockers, caveats, or reasons a row is research-note only.
Attribution confidenceConfidence that the result reflects model capability rather than noise.
Public statusIndexable, research-note only, or private summary.