Methodology

Research documentation for the Light Research Atlas. Purple Games measures LLM attacker-vs-defender cyber matches with defense as the primary goal, then separates observed facts, judge-panel interpretations, confidence, and public claim boundaries.

public status methodology in progress

current evidence

The public corpus is real-match evidence, but current rows remain research notes until canonical bundles, redaction, and replication gates are complete.

scoring boundary

Programmatic facts determine outcome categories. Judge-panel scores describe side behavior and uncertainty; they do not decide winners.

claim posture

Methodology pages explain how claims become possible later. They do not convert current research-note rows into model rankings.

public claim boundary

The public site can describe the research program, the current scoring design, and known limitations. It cannot turn forecasts, calibration runs, private artifacts, or research-note rows into leaderboard claims. If a sentence could be read as a model ranking, it needs indexable observed evidence and visible uncertainty.

current rows

Rows on /benchmarks are research-note only while canonical bundles, redaction, and indexability hardening remain in progress.

methodology index

GAMES FrameworkGenerative Agentic Model Evaluation for Security: a defense-first research frame for repeatable AI cyber matches, agent-readable plans, evidence artifacts, and public claim boundaries.v0.1 direction Match MechanicsHow real matches move from setup through live play, referee events, scoring artifacts, confidence review, and benchmark feeds.methodology in progress ScoringHybrid scoring separates deterministic facts from rubric-bound judge-panel skill scores.schema v1 ConfidenceConfidence labels describe how much evidence supports a comparison before any ranking language is allowed.methodology in progress Scenario FamiliesReusable range archetypes define the environment shapes used to stress red and blue capabilities.methodology in progress Agentic Incident ResponseHow Purple Games evaluates an agentic blue-team organization across detection, triage, containment, recovery, memory, coordination, and measured artifacts.methodology in progress

evidence labels

label	meaning	public index status	table rule
Observed	A live match run through the normal harness path, with preserved event logs and scoring artifacts.	Eligible only after scenario, harness, ground truth, scoring, costs, model identifiers, and redaction gates pass.	Can support public benchmark rows when the full artifact chain is indexable.
Replay	A deterministic re-run of scoring, audit, visualization, or extraction over preserved artifacts.	May update derived fields on an existing observed row, but does not create new agent-play evidence.	Must name the replayed schema or policy version when it changes a field.
Calibration	A run used to test harness behavior, marker contracts, judge prompts, scenario health, or scorer drift.	Not indexable as model performance. Useful as methodology evidence.	Can explain why a scorer, range, prompt, or marker changed.
Forecast	A projected result, expected curve, planning estimate, simulation, or hypothesis.	Never indexable. Must be visually and textually distinct from measured evidence.	Cannot support rows, rankings, or measured-performance language.
Methodology in progress	A design, policy, rubric, prompt, or analysis rule that is not frozen for a published result.	Public as process documentation, not as settled benchmark evidence.	Can explain intent and current controls, not final model-performance claims.

artifact lifecycle

step	artifact output	public gate
Live match	Event log, transcript, range metadata, model and budget records, cost data, and initial scorer outputs.	The competitive window must have intended side isolation, required markers, and no critical operator intervention.
Local validation	Range-health checks, ground-truth reconstruction, referee version, scenario version, and required marker coverage.	Invalid range state or contaminated repairs keep the artifact out of benchmark indexes.
Redaction	Public-safe transcript or bundle with stable placeholders for secrets, hosts, accounts, and infrastructure identifiers.	Redaction must remove unsafe operational detail while preserving causality.
Indexability review	A public status decision: indexable, research-note only, or private until fixed or redacted.	Research notes can support methodology claims, but not model rankings.
Replay or backfill	Derived facts recomputed under a named schema, prompt, panel, or extraction policy.	Replay-derived changes must be labeled and versioned next to the affected field.

scoring schema

Programmatic facts

Facts are reproducible fields derived from preserved bus events. They are string comparisons, presence checks, counts, durations, costs, or other deterministic predicates.

Examples include red_read_real_flag, blue_canary_substitution, red_exfil_canary, blue_contained_red_source, commands executed, duration, and dollars spent.
Facts decide the categorical outcome under a named outcome schema.
Facts stay separate from skill interpretation so the match outcome does not drift when a judge model changes.

Judge-panel skill scores

A three-family judge panel scores the transcript and programmatic facts on a 0-10 rubric. The panel does not pick the winner.

Dimensions are red_methodology, blue_detection, blue_response_effectiveness, and creativity.
Each dimension reports a mean score and disagreement signal.
Judges score the side, not the model, and model identifiers are scrubbed from judge-facing transcript payloads.

Known scoring limits

Judge-panel scores are rubric-bound research measurements, not ground truth. Purple Games treats disagreement and drift as data to report, not noise to hide.

Known risks include self-preference, length bias, apparent-winner bias, provider nondeterminism, and scorer drift.
Prompt or panel edits require a schema version bump and replay notes.
Small samples remain match stories or research notes until mirrored evidence narrows uncertainty.

indexability and public claims

Indexability rules

Observed rows need match.start and match.end, identifiable scenario and harness versions, preserved event logs, reproducible facts, and public-safe redaction.
Calibration, forecast, contaminated, private, or incomplete artifacts cannot become model-performance rows.
Unknown cells remain unavailable. They are not filled with expected values, illustrative placeholders, or synthetic representatives.
Current benchmark rows are research-note only while canonical bundles, redaction, and indexability hardening are still in progress.

Public claim boundaries

Allowed: Purple Games is building an open methodology for LLM attacker-vs-defender cyber matches.
Allowed: the scoring design separates deterministic facts from interpretive skill scores.
Allowed: a historical artifact can illustrate a methodology issue when its limitation is stated near the claim.
Not allowed: model rankings from research notes, calibration runs, private runs, forecasts, or single unmirrored matches.
Not allowed: claims that a defender is production-ready because it performed well in Purple Games.

safety and redaction

Remove credentials, tokens, API keys, cookies, SSH keys, session identifiers, and provider resource IDs.
Transform real IP addresses, hostnames, account names, email addresses, billing metadata, and cloud project identifiers unless intentionally synthetic.
Paraphrase exploit strings or command sequences when the exact text would materially improve misuse outside the benchmark.
Use stable placeholders such as REDACTED_TOKEN_1 or SCENARIO_HOST_A so readers can follow causality.
Review screenshots and videos frame by frame for terminal history, dashboards, logs, browser bars, secrets, and account metadata.

benchmark table schema

column	purpose
Match ID	Stable artifact reference for a live match or public research note.
Evidence label	Observed, Replay, Calibration, Forecast, or Methodology in progress. Replay-derived fields name the replay schema next to the field.
Scenario	Scenario or reusable archetype name, with version when available.
Red model	Attacker model identifier or a stated anonymization/unavailable note.
Blue model	Defender model identifier or a stated anonymization/unavailable note.
Outcome	Deterministic categorical result under a named outcome schema.
Outcome schema	Version of the deterministic fact and outcome contract used.
Skill scores	Judge-panel means and disagreement, if available.
Judge-panel schema	Panel composition and prompt/schema version for any skill scores shown.
Sample size	Number of observed runs behind a row or aggregate.
Missing data	Unavailable fields marked explicitly without implying a zero score.
Limitations	Known blockers, caveats, or reasons a row is research-note only.
Attribution confidence	Confidence that the result reflects model capability rather than noise.
Public status	Indexable, research-note only, or private summary.