Skip to main content

Alerts

The Alerts screen surfaces fleet health problems as discrete, deduplicated events. Alerts are raised by background evaluators that run on the tower's sweeper schedule, and they resolve automatically when the underlying condition clears.

Alerts page showing the active alerts panel and resolved alert history

Severity levels

SeverityBadge colourWhen to act
CRITICALRedImmediate action required — an agent is blocked or a hard budget ceiling is breached.
WARNINGAmberSomething needs your attention soon — a soft threshold is crossed or an instance went offline.
INFOBlueAwareness only — an instance is stale or behind on version.

Alert types

Budget breach

Raised when a squad's spend crosses a budget threshold. Two variants:

RuleSeverityCondition
Hard budget breachCRITICALA squad's month-to-date spend crossed the hard limit. If the tower limit is in Hard mode, agent runs are blocked until the month resets.
Soft thresholdWARNINGA squad's spend reached the soft threshold (default 80% of its monthly budget).

The alert title identifies the squad and instance — for example: "Hard budget breach — Engineering Squad".

Tower limit

Independent of what an instance reports, the tower checks each instance's month-to-date usage against the ceiling set in Budgets & Limits on every sweeper run. It compares its own cost_facts table against the resolved limit — dollar cost for metered runs, tokens for subscription runs.

RuleSeverityCondition
Budget limit reachedCRITICALAn instance's MTD spend or tokens reached the tower ceiling (observed ≥ ceiling).
Budget limit warningWARNINGAn instance's MTD spend or tokens crossed the warn threshold (observed ≥ warn% of ceiling) but is still below the ceiling.

The detail line names the metric and progress — for example: "Control-tower cost limit reached: $48.00 of $48.00" or "Control-tower tokens limit at 85%: 8,500,000 of 10,000,000 tokens". Cost and token breaches are tracked separately, so an instance can raise both at once.

info

These tower-side checks are separate from the instance-reported budget facts the agent control plane emits. The tower enforces its own ceiling as defence in depth, so a Tower limit alert can fire even when the instance has not reported a breach itself.

Both auto-resolve once the instance's spend drops back below the threshold or the month resets.

Instance offline

RuleSeverityCondition
Instance offlineWARNINGAn instance has missed 3 consecutive expected heartbeats.

The status sweeper marks the instance offline and the alert evaluator raises this alert. Resolves automatically when the instance resumes reporting.

Instance stale

RuleSeverityCondition
Instance staleINFOAn instance has been silent for more than 24 hours without being explicitly taken offline.

Detail reads: "hostname · instanceId silent > 24h". Resolves automatically when the instance sends a heartbeat.

Spend spike

RuleSeverityCondition
Spend spikeWARNINGAn instance's spend today is more than 3× its trailing 7-day daily average.

The evaluator queries rollups_daily for the past 8 days: it compares today's aggregated cost against the average of the prior 7 days and flags the instance when today > avg × 3. Detail shows: "today $X.XX vs 7-day avg $Y.YY". Resolves automatically the next day if spend returns to normal.

Version drift

RuleSeverityCondition
Version driftINFOOne or more instances are running a SLAW version below the highest version seen in the fleet.

The fleet target is the maximum slawVersion reported across all enrolled instances. Detail lists which instances are behind and their current versions. Resolves when all instances have upgraded to the fleet target.

Skill catalog drift

RuleSeverityCondition
Skill catalog driftINFOOne or more instances have not yet acknowledged the current published catalog version from Skill Registry.

Resolves automatically once every active instance acks the current catalog version on a sync heartbeat.

Active and resolved tabs

Active — alerts that are currently raised. The count appears in the Fleet View KPI tile.

Resolved — the 20 most recently resolved alerts, dimmed. Use this to confirm that an auto-resolve occurred after a condition cleared.

Acknowledging an alert

Click Acknowledge on any active alert to mark it as seen. Acknowledging is a soft action — it does not resolve the alert or clear the underlying condition. The alert remains in the active list until the evaluator resolves it automatically.

Deduplification

Each alert type is keyed by (rule, instanceFk, squadLocalId). If the same condition persists across multiple evaluator runs, no duplicate is created — the existing alert stays active until the condition clears and the evaluator resolves it.


Next steps

  • Budgets & Limits — adjust spending ceilings to prevent hard-breach alerts.
  • Fleet View — see the active alert count in the top-bar KPI tile.
  • Cost Analytics — investigate spend spikes with the daily chart.