GAMES Framework

GAMES means Generative Agentic Model Evaluation for Security. It is the Purple Games research framework for measuring how AI agents attack, defend, detect, and recover in repeatable cyber scenarios.

v0.1 public methodology

primary question

What does a cyber match prove, and what does it merely illustrate?

research posture

GAMES keeps offense visible so defense can be measured with context, not as isolated alert text.

public limit

The framework supports transparent research notes today and stricter benchmark claims only after indexability review.

source boundary

Reflects the approved Light Research Atlas launch direction. Current evidence labels remain conservative, and current rows stay research-note only.

What the framework measures

Purple Games runs attacker-vs-defender matches in sealed cyber ranges. Defense is the primary goal, and the framework treats blue as an organization of cooperating roles rather than a single operator. Offensive progress remains visible so defensive results have context.

Red-side capability: recon quality, exploit pathfinding, cloud or identity awareness, tool use, and progress through friction.
Blue-side capability: detection quality, response effectiveness, containment, cost control, reliability, coordination, and defensive creativity.
Safety behavior: whether agents stay inside range boundaries and whether public artifacts can be redacted without losing scientific meaning.

Agentic blue-team architecture

The blue-team framework is organized as an agent-readable incident response loop. A match begins with an IRPlan, binds the plan to available defensive skills, assigns fast-path roles, shares case state, emits markers, and then turns the preserved record into GAMES metrics and indexability decisions.

IRPlan -> Skill Catalog -> Role Coordinator -> Blackboard/Case Memory -> Marker Stream -> GAMES Metrics -> Scoring/Indexability.
Watcher, Hunter, and Responder are fast-path roles for monitoring, investigation, and bounded response actions.
Commander or Incident Case Manager logic is deterministic and advisory by default, adding review for high blast-radius actions without slowing routine detection and triage paths.

Evidence artifacts

A match is more than a transcript. Public research evidence needs structured data that can be queried, replayed, and audited without re-reading every line. Each match should leave behind measured artifacts, not only an after-action narrative.

Core artifacts include event logs, transcripts, range metadata, IRPlan versions, role assignments, memory snapshots, budget records, programmatic facts, outcomes, metrics, and panel verdicts.
Marker vocabulary includes range.health, range.failure, evidence.detected, defense.action_proposed, defense.action_applied, defense.action_effect, attack.stage_reached, scoring.fact, and benchmark.metric.
Markers should be typed, schema-versioned, timestamped, side-aware where relevant, and linked back to raw events or transcript sequence numbers.

Claim boundary

The framework is designed to make benchmark claims possible later without overstating what current artifacts prove now.

Current public rows are research-note only unless and until they pass indexability review.
Simulation can support scenario design, balance testing, replay, visualization, and forecasting, but not leaderboard rows.
Model-vs-model comparisons need mirrored real-infrastructure evidence and visible uncertainty.

methodology pages

GAMES Frameworkv0.1 direction Match Mechanicsmethodology in progress Scoringschema v1 Confidencemethodology in progress Scenario Familiesmethodology in progress Agentic Incident Responsemethodology in progress