To be honest with you, I spent part of this week learning that my scoring system was too dumb for the thing I was asking it to judge.
That sounds bad. It is bad. It is also exactly why Purple Games exists. If we already knew how to measure agentic cyber defense cleanly, I would not need to build this weird little lab where AI attackers and AI defenders keep embarrassing my assumptions in public.
The first version of the benchmark was predicate-heavy. Did red reach a stage? Did blue run a containment command? Did the match end with this exact event? It felt responsible because predicates are deterministic. They do not get moody. They do not wake up one morning and decide a mediocre defense was secretly brilliant.
But cyber defense is not just a checklist. A good defender does not always look like a good defender if your scoring system only understands the moves you predicted ahead of time.
The Match That Broke The Rubric
One of the early matches had a blue agent do something genuinely interesting. Instead of just blocking traffic or yelling into the void, it treated the environment like an incident. It preserved the real asset, substituted a canary value, and checked whether the attacker would take the bait.
That is defense. That is not theater. That is the kind of thing I want security teams to study because it looks a lot closer to how a real incident response plan might evolve when agents can reason, test, and adapt inside the environment.
And the old rubric basically shrugged. It did not know what to do with the move because I had not taught it that this particular move existed. The defender did the smart thing and the rules were sitting there with their pants on backwards.
That was the moment. Predicates can catch facts. Predicates can catch events. Predicates can catch whether a known thing happened. But predicates cannot grade strategy by themselves.
Why The Judge Panel Exists
So Purple Games now uses a hybrid model. The deterministic layer still matters. It matters a lot. If red actually reads the protected value, that should be grounded in the event stream. If blue actually contains the attacker, that should be grounded in evidence. I do not want a model inventing a heroic defender story because the transcript had good vibes.
But above that floor, we need judges. Independent judges. Different model families. A fixed rubric. Red methodology. Blue detection. Blue response effectiveness. Creativity. Each judge scores the match without seeing the other judges' cards, then the benchmark records the mean and the disagreement.
It is basically a boxing card for AI cyber matches. Not perfect. Not magic. But a lot more honest than pretending a few hardcoded rules can understand every defensive idea an agent might discover.
The disagreement matters too. If the judges split, that is not just noise. That is research data. Maybe the rubric is vague. Maybe the play was genuinely hard to evaluate. Maybe the environment failed in a way that made the result weird. Good. Surface that. Do not hide it behind a fake clean number.
Why This Is Defense-First
Everybody wants to watch the attacker. I get it. Offense is cinematic. Offense has the cool verb energy. Scan. Exploit. Pivot. Exfiltrate. Defense gets stuck with the boring verbs. Watch. Triage. Contain. Recover. Write the report. Drink the bad coffee.
But the world does not need another leaderboard that says which model can act like a pentester in a toy environment. The world needs to know whether AI defenders can notice trouble, communicate, use an incident response plan, avoid making things worse, and buy humans time.
That is the real question for me. Can an agentic blue team behave like a defense organization instead of one lonely chatbot with a terminal? Can it have a watcher, a hunter, a responder, and eventually a commander? Can it use memory without hallucinating a fake incident history? Can it follow a playbook without becoming slow as hell?
I do not know yet. That is the honest answer. We have research notes. We have real match artifacts. We have bugs. We have a scoring method that is getting less embarrassing every week. We do not have index-grade claims yet.
What Purple Games Will Publish
The public site is going to separate stories from rankings. A single match can be a story. It can be a lesson. It can show how a defender failed or surprised us. It cannot prove one model is better than another model. Not by itself.
That distinction matters because AI benchmarking gets weird fast. You can make a chart look authoritative before the evidence deserves it. You can publish a leaderboard before you understand the variance. You can confuse one dramatic run for a property of the model. I am trying very hard not to do that. I will probably still mess something up. Then we will fix it in public.
For now, the first public layer is simple: explain the methodology, publish research-note articles, show what evidence exists, label what is not ready, and build toward repeatable matches with confidence scores that do not insult everyone's intelligence.
Predicates still matter. Judges matter. Real infrastructure matters. Red team matters. But the point is blue. The point is defense catching up.
That is Purple Games. A little messy. A little too ambitious. Probably exactly the right kind of problem to be messy about.