Skip to main content

Architecture Decisions

This page indexes the key architectural decisions made across SLAW and Botfather, with enough context for internal contributors to understand the why behind major design choices.

Audience

This is an internal reference for contributors. Product-facing architecture overview lives in How SLAW Works.


SLAW

Decision inventory

#DecisionStatusSource
S-1Control plane, not execution planeFoundationaldocs/start/architecture.md
S-2Squad-scoped data isolationFoundationaldocs/start/architecture.md
S-3Single-assignee atomic checkoutFoundationaldocs/start/architecture.md
S-4Adapter-agnostic executionFoundationaldocs/start/architecture.md, packages/adapter-utils
S-5Embedded PostgreSQL by defaultFoundationalpackages/db
S-6Hand-authored migrations (drizzle-kit disabled)OperationalSee S-6 below
S-7"Operator" replaces "Board"Delivered 2026-06-07DESIGN-board-to-operator.md
S-8Discipline leads replace C-suite rolesPlan (not yet built)DESIGN-squad-lead-leads-based-instructions.md
S-9Dynamic adapter registryIn progressadapter-plugin.md
S-10Role column is plain text, not a Postgres enumFoundationalpackages/db/src/schema/agents.ts

S-1 — Control plane, not execution plane

Decision: SLAW orchestrates agents; it does not run them. The server schedules heartbeats, tracks issues, and calls adapters. The agent process runs wherever it runs (local CLI, container, HTTP webhook) and phones home via the REST API.

Why: keeping execution out of the server keeps SLAW lightweight, adapter-agnostic, and safe — a runaway agent can't crash the control plane, and any runtime that can call HTTP can be hired.

Consequence: SLAW has no concept of the agent's internal state during a run. Cost and result data flow back via the adapter result and the agent's own API calls.


S-2 — Squad-scoped data isolation

Decision: every entity (agent, issue, project, routine, skill, cost event) belongs to exactly one squad, enforced at the database layer. Queries are always squad-keyed; there is no global entity namespace.

Why: multi-squad is a first-class use case — one instance runs multiple squads with complete isolation. Cross-squad leaks (a rogue agent reading another squad's tasks) are prevented by construction, not policy.


S-3 — Single-assignee atomic checkout

Decision: moving an issue to in_progress requires an explicit POST /api/issues/:id/checkout that checks and sets the assignee atomically. Simultaneous claims return 409 Conflict. An agent that receives 409 must stop and pick different work.

Why: without an atomic ownership gate, two agents could concurrently start the same task and produce conflicting output. The checkout is the "grab the physical card" primitive.


S-4 — Adapter-agnostic execution

Decision: every agent has an adapterType + adapterConfig blob. The adapter's execute() function is the only interface between SLAW and the agent runtime. Adapters are packages (packages/adapters/*) with three modules: server (execution), UI (run viewer parser + config form), CLI (terminal formatter).

Built-in adapters: claude_local, codex_local, gemini_local, opencode_local, cursor, pi_local, hermes_local, process, http.

External adapters (in progress): the adapter registry on both server and UI is being converted to a mutable map (registerServerAdapter / registerUIAdapter) so plugins can contribute adapters at startup. See adapter-plugin.md and S-9.


S-5 — Embedded PostgreSQL by default

Decision: SLAW ships PGlite (embedded Postgres) as the zero-config default. An external PostgreSQL 17 is available for production deployments. The same Drizzle schema and migration chain runs on both.

Why: "clone and run" onboarding (npx slaw onboard --yes && npx slaw run) must not require a database. PGlite makes the dev and single-user experience friction-free; teams that need durability opt into external Postgres.


S-6 — Hand-authored migrations (drizzle-kit disabled)

Decision: drizzle-kit generate is disabled in this repo. Migrations are hand-authored idempotent SQL files named packages/db/src/migrations/00NN_<slug>.sql, accompanied by a journal entry appended to _journal.json. The custom applier applyPendingMigrationsManually runs them in order.

Why: drizzle-kit's snapshot mechanism drifted when the Paperclip workspace_* tables were removed, causing it to emit a spurious 200-line migration on every invocation. Rather than fight the snapshot, the team adopts the hand-authored pattern already established for several migrations (0094–0097). Each migration is written as CREATE TABLE IF NOT EXISTS / ALTER TABLE IF EXISTS / UPDATE … WHERE … — idempotent by construction.

Pattern for new migrations:

  1. Write packages/db/src/migrations/00NN_<slug>.sql as idempotent SQL.
  2. Append an entry to _journal.json: { "idx": NN, "version": "7", "tag": "00NN_<slug>", "breakpoints": true }.
  3. Update the Drizzle schema file(s) to match.
  4. Run the server once to confirm the migration applies cleanly.

S-7 — "Operator" replaces "Board"

Status: Delivered 2026-06-07 (migration 0097_board_to_operator, all packages clean).

Decision: the instance-level human governance actor is called Operator, not "Board". The UI kanban task board deliberately keeps the word "Board" (it is a literal board, not a governance role).

Rationale: "Board" was inherited from Paperclip's multi-company model where a Board governed many Companies. SLAW's enterprise squad model has no multi-company governor; the per-instance full-control human actor is the Operator. Every tier now has a distinct noun: Botfather → Instance → Operator → Squad → Squad Lead → Agents.

Key changes:

  • Actor type "board""operator"; source "board_key""operator_key".
  • Local actor id local-boardlocal-operator (rekeyed in 0097).
  • DB table board_api_keysoperator_api_keys; token prefix pcp_board_*slaw_op_*.
  • Auth service, CLI auth, UI routes, and copy all renamed. See DESIGN-board-to-operator.md for the complete rename map.

Upgrade note for existing dev instances: migration 0097 does structural renames only. An existing dev DB should be wiped and re-seeded (WIPE-AND-RESEED.sh) rather than migrated in place, because local-board user-id values are stored across ~40 columns and partial rekeys left the profile page broken in testing. Fresh instances seed directly as local-operator.


S-8 — Discipline leads replace C-suite roles

Status: Plan (not yet built) — DESIGN-squad-lead-leads-based-instructions.md.

Decision: rename the first-tier agent roles from C-suite (cto, cmo, cfo) to discipline leads (engineering_lead, marketing_lead, finance_lead). Rewrite the Squad Lead's default AGENTS.md to route work by discipline (Engineering Lead, Design Lead, Product Lead, Marketing Lead, QA Lead) rather than to a fixed C-suite org chart.

Why: the Squad model uses "Squad Lead", not "CEO". The first agent auto-created by the Squad Lead onboarding used CEO/CTO/CMO terminology inherited from Paperclip, which now conflicts with the language the rest of the product uses. The role column is plain text (S-10), so renaming is a simple UPDATE + constants change — no Postgres enum migration.

Note: SOUL.md (the persona file) is intentionally left untouched this pass.


S-9 — Dynamic adapter registry

Status: In progress (feat/external-adapter-phase1).

Decision: convert the server and UI adapter registries from static enum-backed maps to mutable registries (registerServerAdapter / registerUIAdapter / unregisterServerAdapter / unregisterUIAdapter). Runtime validation moves to the server routes (assertKnownAdapterType), so the shared schema accepts any non-empty string and the server is the source of truth.

Why: third-party and plugin-contributed adapters must be able to register at startup without forking the core codebase. The current enum-backed approach requires SLAW to enumerate every adapter at build time.

Scope of phase 1: mutable server + UI registries; AgentConfigForm and NewAgent derive their adapter lists from listUIAdapters(); shared validators accept open-ended input. Surfaces not yet touched: NewAgentDialog, OnboardingWizard, InviteLanding, plugin manifest loader.


S-10 — Role column is plain text, not a Postgres enum

Decision: agents.role is text NOT NULL DEFAULT 'general', not a Postgres ENUM type. Application-layer validation uses the AGENT_ROLES array in packages/shared/src/constants.ts.

Why: Postgres enum renames require ALTER TYPE … RENAME VALUE, which is a DDL lock. The plain-text column makes renaming values a safe UPDATE agents SET role = 'new_value' WHERE role = 'old_value' — the pattern used in S-8. There are no foreign key relationships keyed on the role value, so renames have zero cross-table impact.


Botfather

Decision inventory

#DecisionStatusSource
B-1Sovereignty-first: push not pull, metadata not contentFoundationalARCHITECTURE.md §1
B-2Auto-enrollment + approval queue (no end-user tokens)Delivered 2026-06-06ARCHITECTURE.md §6
B-3Fail-open for enrolled, fail-closed for unenrolledDelivered 2026-06-06ARCHITECTURE.md §6.4
B-4Identity on (machineId, instanceId) until EntraIDFoundationalARCHITECTURE.md §3
B-5At-least-once sync with cursor deduplicationFoundationalARCHITECTURE.md §4.3
B-6Centrally-governed budget limits via back-channelDelivered 2026-06-06DESIGN-budget-limits.md
B-7Tower-mastered skill registryDelivered 2026-06-07DESIGN-skill-registry.md
B-8One binary, one Postgres (mirror SLAW stack)FoundationalARCHITECTURE.md §1
B-9Tower-authoritative config flows down, not upFoundationalARCHITECTURE.md §10

B-1 — Sovereignty-first: push not pull, metadata not content

Decision: SLAW instances must remain fully functional when Botfather is unreachable. All communication is initiated outbound by the instance over HTTPS — no inbound ports on desktops. The tower receives metadata and metrics only: names, statuses, counts, costs. Issue bodies, comments, agent configs, secrets, run logs, and code never leave the instance.

The sovereignty boundary is the product headline. What is synced: squad names/status, agent titles/roles/status, project names, issue titles + status (opt-out via reportIssueTitles: false), cost events. What never leaves: agent adapter config, secrets, issue bodies/comments, run output.


B-2 — Auto-enrollment + approval queue (no end-user tokens)

Status: Delivered 2026-06-06, replacing an earlier enrollment-token design.

Decision: on first start with a botfather.url configured, the instance self-enrolls by POST /api/ingest/v1/enroll with its identity (no token). The instance lands in pending in the tower's Approval Queue. An admin approves (manually, or via auto-approve rules matching hostname/machineId patterns). On approval, the tower issues a per-instance API key the instance stores in its secrets store. All subsequent calls use Bearer auth.

Why no end-user tokens: handing tokens to every end user creates a credential management problem at fleet scale. Auto-enrollment with admin-controlled approval gives the same security properties with zero user friction.


B-3 — Fail-open for enrolled, fail-closed for unenrolled

Decision: two enforcement modes, set via botfather.enforcement in instance config (delivered by IT/MDM):

  • enforce (default): a never-enrolled instance stays gated at the startup UI and keeps retrying until approved. An already-enrolled instance with a valid cached API key is allowed to run (fail-open for the enrolled) and keeps spooling until the tower returns. A working machine is not held hostage by tower downtime; a new machine cannot bypass the tower by pulling the network cable.
  • advisory: the startup gate is dismissible; the instance reports best-effort. For pilots.

Why: enterprise policy requires that new endpoints be admitted to the fleet before they can run. But forcing enrolled instances offline during tower maintenance would create unacceptable disruption for the teams using them.


B-4 — Identity on (machineId, instanceId) until EntraID

Decision: the primary identity tuple is (machineId, instanceId). machineId is derived once per machine from the OS hardware ID (hashed with an app salt) and stored at ~/.slaw/machine.json — shared across all instances on one box. instanceId is the existing SLAW_INSTANCE_ID value (default "default"). Child entities are keyed as (machineId, instanceId, localId).

Future path: a nullable userPrincipal column is reserved on the instances table from day one. When EntraID arrives, the enrollment exchange will accept an Entra token, auto-approval will key off Entra group membership, and the UI will group by person. No schema migration is needed for the EntraID upgrade.


B-5 — At-least-once sync with cursor deduplication

Decision: the instance tracks a botfather_sync_state cursor table (per-entity updatedAt / max-id high-water mark). Fact events are read strictly above the last-acknowledged cursor. The sync response acknowledges each batch with the new cursor; the instance only advances its cursor on ack. Botfather deduplicates facts on (machineId, instanceId, localId), yielding effective exactly-once delivery despite at-least-once transport.

Why not exactly-once at the transport layer: HTTP over internet paths drops and retries. Exactly-once transport requires coordination that adds latency and complexity. Dedup on a stable natural key is cheaper and correct.


B-6 — Centrally-governed budget limits via back-channel

Status: Delivered 2026-06-06. See DESIGN-budget-limits.md for the complete protocol and implementation record.

Decision: Botfather governs cost (cents) and token ceilings fleet-wide. An enterprise-default limit is set once and propagates to all instances; per-instance overrides are available. The resolved limit rides the heartbeat/sync response directives array as a {kind:"set_limits", limit} directive — the existing push-only back-channel. No inbound connection is opened.

Key design points:

  • Tower caps; local can be stricter. The tower limit is an additive ceiling on top of existing squad/agent budgets. Local budgets are never weakened.
  • Metric is plan-aware. Cost metric for metered (API billing); token metric for subscription. One LimitSpec carries both ceilings; the enforcer picks the relevant one by billingType.
  • Enforcement modes: off | soft (warn only) | hard (block runs at ceiling).
  • SLAW-side: a new instance_limits singleton table stores the last-applied limit. getInvocationBlock() gains an instance-wide gate (all ~9 call sites inherit it). evaluateCostEvent() warns at warnPercent and pauses at ceiling when hard.

B-7 — Tower-mastered skill registry

Status: Delivered 2026-06-07. See DESIGN-skill-registry.md for the complete record.

Decision: Botfather is the single source of truth for skills across the fleet. Skills are authored and published in the tower's Skill Registry (skill_library table: key, name, description, markdown body, monotonic version, draft/publish/deprecate lifecycle). Connected instances pull the catalog via GET /api/ingest/v1/skills and install chosen skills onto local squads. The skills_updated directive in the heartbeat/sync back-channel hints that the catalog has advanced; the instance pulls on its own schedule.

Sovereignty constraint: the skill body flows tower → instance (governed content). Squad composition — which agents exist, what they do, who reports to whom — stays entirely local. Botfather has no opinion about squad structure.

Local authoring lock: when an instance is connected and enrolled, local skill create/import/edit routes return 409 skills_managed_by_tower. Standalone instances (no botfather.url) keep local authoring. Tower-managed skills display a "Control tower · vN" badge and are read-only.


B-8 — One binary, one Postgres (mirror SLAW stack)

Decision: Botfather reuses the SLAW technology stack (Node.js + Express + React + Vite + PostgreSQL + Drizzle + pnpm monorepo) rather than introducing a separate backend or cloud service.

Why: Botfather is self-hosted enterprise infrastructure. Operators who already run SLAW understand the stack. Reusing it means Botfather can share packages/shared code, follows the same migration discipline, and requires no new operational knowledge.


B-9 — Tower-authoritative config flows down, not up

Decision: all data normally flows instance → tower (squads, agents, issues, cost facts). Budget limits (B-6) and skill catalog (B-7) are explicit exceptions: they are tower-authoritative config that flows tower → instance. No instance can write to enterprise_limits, instance_limit_overrides, or skill_library — these are admin-only tables on the tower.

Why explicitly called out: the push-only, instance-initiated model is a security principle. Every tower-authoritative config item must be a justified exception: it must be a typed, bounded directive (never arbitrary code), transmitted through the existing response back-channel, and version-de-duped by the instance. New "tower pushes to instance" features must follow this pattern.


Cross-cutting

Stack and toolchain choices

ConcernChoiceNotes
Package managerpnpm 9 workspacesPin via packageManager field; corepack enable to activate
ORMDrizzleSchema as code; migrations are hand-authored (see S-6)
UI frameworkReact 19, Vite 6, TanStack Query, Radix UI, Tailwind CSS 4
AuthBetter AuthSessions + API keys
DB defaultPGlite (embedded)PostgreSQL 17 for production
Wire formatJSON over HTTPS, gzipped for ingest payloads
Protocol versioningVersioned envelope (protocolVersion); tower accepts N and N−1

Migration discipline

Both repos hand-author migrations. The pattern:

  1. Write an idempotent .sql file in packages/db/src/migrations/.
  2. Append a journal entry to _journal.json.
  3. Apply via applyPendingMigrationsManually (SLAW) or drizzle-kit migrate (Botfather, which doesn't have the snapshot drift issue).
  4. Never edit a shipped migration — write a new forward-only one instead.

Next steps