Public snapshot 9 real matchesPublication status 8 scored research notesIndex-grade rows 0
Light Research Atlasresearch-note onlyAnecdotalcurrent corpus

Purple Games

A defense-first benchmark for agentic cyber matches. The public surface starts with real AI-vs-AI match artifacts, labels every row by evidence strength, and keeps current results framed as research notes until index-grade methodology and replication are ready.

observed matches
9

8 scored public research-note rows; match 004 aborted/excluded

index-grade rows
0

No rows have cleared canonical bundle, redaction, mirrored-run, and uncertainty gates yet.

judge panel
3

claude-opus-4.7 + gpt-5.4 + gemini-2.5-pro

evidence boundary

Purple Games can show observed match notes, scoring design, and methodology status today. It does not present model rankings, production-readiness claims, or index-grade rows from the current corpus.

The launch snapshot uses 8 scored public research-note rows from 9 real matches; match 004 aborted/excluded.

Cyber Defense

Detection, response, containment, cost control, and incident response measured against observed match artifacts.

Cyber Offense

Recon, exploit pathfinding, cloud awareness, tool use, and attacker progress kept visible as context for defense.

GAMES Framework

Generative Agentic Model Evaluation for Security: evidence labels, scoring boundaries, confidence, and match mechanics.