Did the multi-blue team outperform the single defender?

Not on response effectiveness. In the four clean cells, blue detection improved in three cells versus the Kings baseline, but blue response effectiveness improved only in aa and fell in bb, ab, and ba.

Is this a benchmark ranking?

No. This is a research note from four clean live cells and three dirty attempts. It is evidence about the measurement problem, not a model leaderboard.

The next work is to improve action gating, command policy, and responder handoff before spending on an 8- or 12-match mini-corpus.

The First Multi-Blue Pulse: Better Detection, Still Not Enough Defense // Purple Games

The first multi-blue pulse did not give me the clean victory story I wanted. It gave me a better research result: the AI defenders saw the attack, named the attack, produced a lot of evidence, and still mostly failed to defend.

That matters because Purple Games is not trying to prove that adding more agents magically makes defense better. The actual question is narrower and more useful: can an AI blue team listen, investigate, coordinate, and act under pressure better than one blue agent?

What we ran

The pulse compared a role-specialized blue topology against the Kings single-agent baseline on archetype-A, the cloud-native startup range. Blue was split into Watcher, Hunter, and Responder roles. The model grid mirrored the Kings cells: Opus 4.7 self-play, GPT-5.5 self-play, Opus red against GPT blue, and GPT red against Opus blue.

Four cells completed cleanly: MB101, MB102, MB104, and the later ba fill-in MB301. The dirty attempts stay in the record too: MB103 hit a provider context limit before the context guard existed, and MB105 and MB106 hit repeated Anthropic Haiku overloads while trying to fill ba. They are part of what live evaluation costs.

The result

The clean result is stark, but not simple. Across the four clean cells, multi-blue detection improved in three: +3.67 in aa, +2.33 in bb, and +4.50 in ab. The ba fill-in moved -0.33 on detection, so this is not a blanket detection win. It is evidence that role-specialized defense often raised signal.

But response did not follow detection. Blue response effectiveness moved only +0.17 in aa, then dropped -2.33 in bb, -1.50 in ab, and -4.67 in ba. MB301 is the sharpest failure: the judges scored blue response at 0.0 while red reached the protected objective.

Tracked match-loop API spend across the multi-blue pulse attempts was about $32.01, with the most expensive clean match at $12.74. That is low enough to keep iterating, but high enough that the next spend should buy a better action loop, not just more rows.

What this says

The early hypothesis is not that role-specialized defense failed. It is that role-specialized defense exposed the real bottleneck. Watchers can create signal. Hunters can classify signal. Responders can propose action. The system still needs to turn that chain into timely, safe containment before the attacker reaches the protected asset.

That is a better fellowship story than a premature win. It says the project has moved past toy scoring into a harder measurement problem: defensive agency is not one skill. It is a pipeline from observation to decision to action, and every handoff can fail.

What we are not claiming

This is not a claim that multi-agent defenders are worse than single defenders. It is not a claim that one frontier model is better than another. It is not production-readiness evidence. It is a four-clean-cell research note on one scenario family with provider and infrastructure failures disclosed.

The recommendation from the analysis gate is publish_pulse_failure_note. I agree with it, with one condition: the next data should come after improving action gating, command policy false positives, context overflow, and provider retry behavior. More matches without those fixes would mostly measure the same bottleneck again.

Why this still feels like progress

A defense benchmark should be able to publish an uncomfortable result. This one can. The agents did not win, but the artifact tells us where they broke: not at perception, but at coordinated response. That is exactly the kind of failure a real AI cyber-defense benchmark should make visible.

The First Multi-Blue Pulse: Better Detection, Still Not Enough Defense

claim boundary

methodology sources