Agentic Incident Response
Agentic Incident Response evaluates a blue-team organization, not a lone operator: coordinated defender roles follow agent-readable plans, share case memory, take bounded action, and leave measured artifacts inside a repeatable range.
source boundary
This page is a public methodology surface only. It does not claim current incident-response rankings.
Organizational model
Purple Games treats defense as a small operating system for incident response. Roles can specialize, coordinate, and preserve context while the benchmark records whether that organization actually improved detection and response.
- Watcher monitors telemetry and raises grounded signals early enough to matter.
- Hunter investigates hypotheses across alerts, logs, host state, identity activity, and range ground truth.
- Responder proposes or applies bounded controls, tracks consequences, and avoids broad actions when evidence is thin.
- Commander or Incident Case Manager logic remains deterministic and advisory unless a high blast-radius action needs an explicit gate.
Agent-readable incident plans
An IRPlan is the public-facing contract for what a defender organization is trying to do. It turns scenario objectives, authority limits, escalation gates, and evidence requirements into a plan that agents and scorers can both read.
- IRPlan defines phases such as watch, investigate, contain, recover, communicate, and review.
- Skill Catalog entries describe available defensive procedures, required inputs, expected outputs, safety gates, and marker obligations.
- Role Coordinator assigns work to Watcher, Hunter, and Responder paths while keeping routine work fast and reserving deterministic review for high-risk actions.
Evaluation surface
Incident-response scenarios ask whether the defender organization can build a grounded picture of the incident and take useful action without damaging the range or hiding evidence.
- Detection: find relevant signals early enough to matter.
- Triage: connect alerts, logs, host state, identity events, and known ground truth into a coherent hypothesis.
- Containment: apply targeted controls that affect the actual red path rather than generic theater.
- Recovery: restore trustworthy service state or preserve a safe degraded mode when restoration is out of scope.
- Communication: record what happened, what changed, and what remains uncertain.
Memory boundaries
Agentic defense needs memory, but public methodology has to distinguish useful state from unsafe disclosure. Purple Games records memory categories as artifacts while redacting material that would expose secrets or misuse-ready details.
- Match working memory captures the current hypotheses, assigned tasks, evidence pointers, open questions, and action gates for a live match.
- Case memory preserves the incident timeline, decisions, applied controls, unresolved risks, and final case summary.
- Lessons memory stores reusable defensive lessons after review, while environment memory tracks stable scenario context such as normal services, expected telemetry, and approved tools.
- Public artifacts do not expose secrets, credentials, unredacted exploit payloads, private chain-of-thought, or sensitive infrastructure identifiers.
Evidence markers
Incident-response rows should be supported by structured markers, not only narrative summaries. The marker stream is the bridge between role activity, case memory, metric computation, and public indexability.
- evidence.detected and evidence.missed track whether blue saw known ground-truth events.
- defense.action_proposed, defense.action_applied, and defense.action_effect connect intent to observed consequence.
- Planned coordination markers such as role.assignment, case.memory_updated, escalation.gate_requested, and escalation.gate_decision would make coordination and review latency visible once implemented.
- range.health and range.failure keep infrastructure faults separate from agent performance.
GAMES metrics
Each match should produce measured blue-team artifacts that can be replayed and compared under named schemas. The goal is not to crown a production defender from one run, but to make progress, latency, safety, and coordination visible.
- Detection latency: time from ground-truth malicious activity to a relevant defender observation.
- Action-gate latency: time spent moving from proposed response to approved or rejected action.
- Coordination quality: whether roles share evidence, avoid duplicate work, preserve uncertainty, and converge on a coherent case.
- Safe action rate: share of defensive actions that stay inside authority, preserve range health, and avoid unnecessary collateral impact.
- Cost per confirmed detection: model, tool, and runtime cost required to produce a validated detection.
Current boundary
The incident-response method is still being shaped by early range work, agentic role design, and hybrid scoring. Public pages should explain the method without implying mature rankings.
- Current benchmark rows remain research-note only.
- No public claim should say an agent is production-ready based on Purple Games performance.
- Safety redaction can keep exact commands, exploit payloads, private reasoning, or sensitive details private while preserving causal summaries and measured markers.