Provider Filtering

Frontier model providers run cybersecurity safety filters on inference requests. During the 2026-05-07 causal test, OpenAI's filter blocked GPT-5.5 mid-match for the same scenario that had run cleanly eight times before. Provider-side filtering is a structural risk for any AI-vs-AI cyber benchmark. This page documents the risk, the harness response, and what it means for benchmark reproducibility.

phase B — disclosed

source boundary

Filter behavior is opaque and changes over time. The list of patterns the harness recognizes is in harness/adapters/openrouter.py and is updated as new filter signatures appear.

What the filter does

OpenAI's cybersecurity safety filter inspects request content and refuses inference for prompts judged to be high-risk cyber activity. The refusal returns HTTP 502 from OpenRouter with a payload like {message: 'This request has been flagged for potentially high-risk cyber activity', code: 502}. Same model + same scenario can be served on one request and refused on the next.

Probabilistic — same prompt may pass on retry within minutes.
Provider-side and opaque — Purple Games cannot inspect the filter's logic.
Disproportionately affects red-side requests, since red is the role making attacker-shaped prompts.
Surfaced via OpenRouter pass-through; same behavior expected on direct provider APIs.

How the harness responds

The OpenRouter adapter recognizes the cybersecurity-filter response shape and retries the request with exponential backoff (3s, 9s) up to two times. The number of retries is recorded as a benchmark metadata field on the match bundle.

Retry triggers on response payloads matching 'high-risk cyber', 'safety-checks/cybersecurity', or 'violates our usage policies'.
Other empty-choices errors (rate limit, routing miss) are surfaced immediately — not all empty-choices failures are filter events.
The match bundle records the count of provider-filter retries so that operators can see the rate per cell, model, and scenario.
If retries exhaust, the harness raises with the upstream error message intact for diagnosability.

What this means for reproducibility

Filter divergence between providers, between API endpoints, and between time windows means the same scenario can score differently across runs through no fault of the model or scenario. Phase B and beyond must disclose and measure filter behavior.

Phase B: explicit fallback model selection per cell. If the canonical red model gets filtered persistently, the cell uses the documented fallback model and is flagged in the corpus.
Phase C: methodology paper discloses the filter risk, the retry policy, and the fraction of attempted matches that hit the filter. Reproducibility section names which provider routes were used.
Phase D: certification submissions specify model + provider + route. Filter divergence between submissions is treated as a benchmark validity concern, not just a methodology footnote.

What this is not

The filter is not a Purple Games defense. Models being refused service does not score blue or red on any rubric. It is purely a benchmark-substrate concern.

Filter rejection that exhausts retries cancels a match before scoring; the match does not produce a bundle.
Filter rejection that succeeds on retry produces a normal bundle with a metadata field flagging the retry count.
The rate of filter rejections is reported but does not enter blue or red dimension scores.