Skip to content

Tuning Storm Detection

Storm detection is configured per agent rule, so each relay can have independent thresholds tuned to its alert volume and criticality.

The number of alerts within the time window that triggers a storm.

  • Type: number
  • Default: 5
  • Range: 2 to 50
{
stormThreshold: 5
}

The time window (in seconds) used to count alerts for storm detection.

  • Type: number
  • Default: 60
  • Range: 10 to 300
{
stormWindowSeconds: 60
}

The number of agents dispatched immediately before storm hold kicks in. These first agents start working right away while the storm is still being evaluated.

  • Type: number
  • Default: 2
  • Range: 0 to 5
{
maxImmediateDispatches: 2
}

Set to 0 to hold all dispatches during a storm. Set to a higher number if you want more agents working in parallel before triage.

For a relay that receives many alerts normally, increase the threshold to avoid false storm detection:

{
agentType: "devin",
integrationId: "int_abc123",
stormThreshold: 15,
stormWindowSeconds: 120,
maxImmediateDispatches: 3,
}

For a relay where every alert matters and storms should be detected quickly:

{
agentType: "cursor",
integrationId: "int_def456",
repository: "https://github.com/org/critical-service",
stormThreshold: 3,
stormWindowSeconds: 30,
maxImmediateDispatches: 1,
}

When a storm is detected, the triage job doesn’t fire immediately. Instead, it uses a debounced delay to wait for more alerts to arrive:

  • Each new alert pushes the triage job forward by 15 seconds
  • The maximum delay is capped at 90 seconds from when the storm was first detected
  • This ensures late-arriving alerts (which may include the actual root cause) are included in triage

For example, if a storm is detected at T=0:

  • Alert at T=5s pushes triage to T=20s
  • Alert at T=10s pushes triage to T=25s
  • Alert at T=20s pushes triage to T=35s
  • No more alerts arrive — triage fires at T=35s

If alerts keep arriving, triage fires at most at T=90s regardless.

After the debounce window, AI analyzes all collected storm alerts. The triage considers:

  1. Timing — earlier alerts are more likely to be the root cause
  2. Infrastructure level — lower-level alerts (database, network) rank higher than application-level alerts
  3. Severity — higher severity alerts with specific error details are more informative
  4. Error specificity — alerts with stack traces, connection errors, or specific error codes are preferred
  5. Causal relationships — patterns where one failure causes others (e.g., DB down causing API timeouts)

The identified root cause alert is then dispatched to a coding agent with the full storm context, so the agent understands it needs to fix the underlying issue rather than a symptom.