Certification Vision
GAMES Certified Defender v1.0 is the destination — a formal certification program for AI-driven cyber defenders, analogous to MITRE ATT&CK Evaluations but for agentic defense under matched conditions. This page sketches direction. Today we are at v0.1.
source boundary
Phase D is six to twelve months from Phase C close. Tier thresholds are intentionally not yet specified — they need empirical calibration from Phase B. See ROADMAP.md for the full phasing and the canonical framework spec at docs/specs/games-framework-v1.md.
The gap
Public AI cyber evaluation today measures attackers, not defenders under matched conditions. Defender capability has no public certification framework. That asymmetry is the project's reason to exist.
- AISI Mythos eval, METR, and Apollo publish offensive-side capability evaluations.
- AIUC-1 certifies organizations that use agentic AI; the question is governance, not capability.
- NIST CSF and NIST AI RMF are program-side and risk-management frameworks; they do not test capability.
- MITRE ATT&CK Evaluations test EDR products against scripted human-led adversary emulation — capability-side, but not AI-defender-versus-AI-attacker.
- The defensive complement to AISI on the AI-vs-AI axis is the gap GAMES fills.
The four-phase ladder
From research note to certification program in four phases. Each phase ends with a deliverable that stands on its own; later phases require earlier ones to land first.
- Phase A — Pilot foundation (in progress): validated methodology, first publishable corpus, canonical framework spec.
- Phase B — Pilot: multi-archetype, multi-model, calibrated panel. The first credible reference.
- Phase C — Benchmark Grade: public reproducibility kit, pinned reference panel, community submission.
- Phase D — Certification: formal program, audit log, tier scheme, public results registry.
Tier preview
Tier definitions defer to the future certification spec at docs/specs/games-certification-v1.md. The thresholds below are the design intent; they require Phase B empirical calibration before they become binding.
- Bronze: detect-only baseline on archetype A.
- Silver: detect and respond on archetype A.
- Gold: detect, respond, and evict across at least three archetypes (A, B, F) with prevention rate above an empirically calibrated floor.
- Platinum: ad-hoc multi-archetype, cross-model panel, reserved for the actual North-Star benchmark candidates.
Auditable and AI-accessible
A defender certification only matters if the artifacts it produces can be verified independently. Two design constraints flow from that.
- Machine-parseable certification artifact: a certification_record.json schema extends the existing public_ingestion_record.json. AI agents can consume cert results the same way humans do.
- Deterministic verification: programmatic facts (Outcome enum) provide a deterministic backbone. Panel scores are reproducible via pinned model versions, frozen prompts, and temperature zero.
- Public test battery: scenario library with versioned manifests. Anyone can run the battery; the canonical scoring runs against canonical inputs.
- Open reproducibility: Packer images plus Terraform modules plus run scripts. Reproduce the matches in your own GCP project, get the same numbers within an empirically calibrated tolerance.
Honest caveat
Today the corpus sits at Pilot at best, Exploratory more honestly. There is no certification program yet. There may not be one for a year.
- The path from v0.1 (today) to v1.0 (certification ready) is the project, not a side ambition.
- Each Phase A through D ships a standalone deliverable that earns the next phase.
- If the project halts at any phase, what shipped before still has value as an open methodology contribution.