Architecture Decisions
This page indexes the key architectural decisions made across SLAW and Botfather, with enough context for internal contributors to understand the why behind major design choices.
This is an internal reference for contributors. Product-facing architecture overview lives in How SLAW Works.
SLAW
Decision inventory
| # | Decision | Status | Source |
|---|---|---|---|
| S-1 | Control plane, not execution plane | Foundational | docs/start/architecture.md |
| S-2 | Squad-scoped data isolation | Foundational | docs/start/architecture.md |
| S-3 | Single-assignee atomic checkout | Foundational | docs/start/architecture.md |
| S-4 | Adapter-agnostic execution | Foundational | docs/start/architecture.md, packages/adapter-utils |
| S-5 | Embedded PostgreSQL by default | Foundational | packages/db |
| S-6 | Hand-authored migrations (drizzle-kit disabled) | Operational | See S-6 below |
| S-7 | "Operator" replaces "Board" | Delivered 2026-06-07 | DESIGN-board-to-operator.md |
| S-8 | Discipline leads replace C-suite roles | Plan (not yet built) | DESIGN-squad-lead-leads-based-instructions.md |
| S-9 | Dynamic adapter registry | In progress | adapter-plugin.md |
| S-10 | Role column is plain text, not a Postgres enum | Foundational | packages/db/src/schema/agents.ts |
S-1 — Control plane, not execution plane
Decision: SLAW orchestrates agents; it does not run them. The server schedules heartbeats, tracks issues, and calls adapters. The agent process runs wherever it runs (local CLI, container, HTTP webhook) and phones home via the REST API.
Why: keeping execution out of the server keeps SLAW lightweight, adapter-agnostic, and safe — a runaway agent can't crash the control plane, and any runtime that can call HTTP can be hired.
Consequence: SLAW has no concept of the agent's internal state during a run. Cost and result data flow back via the adapter result and the agent's own API calls.
S-2 — Squad-scoped data isolation
Decision: every entity (agent, issue, project, routine, skill, cost event) belongs to exactly one squad, enforced at the database layer. Queries are always squad-keyed; there is no global entity namespace.
Why: multi-squad is a first-class use case — one instance runs multiple squads with complete isolation. Cross-squad leaks (a rogue agent reading another squad's tasks) are prevented by construction, not policy.
S-3 — Single-assignee atomic checkout
Decision: moving an issue to in_progress requires an explicit POST /api/issues/:id/checkout that checks and sets the assignee atomically. Simultaneous claims return 409 Conflict. An agent that receives 409 must stop and pick different work.
Why: without an atomic ownership gate, two agents could concurrently start the same task and produce conflicting output. The checkout is the "grab the physical card" primitive.
S-4 — Adapter-agnostic execution
Decision: every agent has an adapterType + adapterConfig blob. The adapter's execute() function is the only interface between SLAW and the agent runtime. Adapters are packages (packages/adapters/*) with three modules: server (execution), UI (run viewer parser + config form), CLI (terminal formatter).
Built-in adapters: claude_local, codex_local, gemini_local, opencode_local, cursor, pi_local, hermes_local, process, http.
External adapters (in progress): the adapter registry on both server and UI is being converted to a mutable map (registerServerAdapter / registerUIAdapter) so plugins can contribute adapters at startup. See adapter-plugin.md and S-9.
S-5 — Embedded PostgreSQL by default
Decision: SLAW ships PGlite (embedded Postgres) as the zero-config default. An external PostgreSQL 17 is available for production deployments. The same Drizzle schema and migration chain runs on both.
Why: "clone and run" onboarding (npx slaw onboard --yes && npx slaw run) must not require a database. PGlite makes the dev and single-user experience friction-free; teams that need durability opt into external Postgres.
S-6 — Hand-authored migrations (drizzle-kit disabled)
Decision: drizzle-kit generate is disabled in this repo. Migrations are hand-authored idempotent SQL files named packages/db/src/migrations/00NN_<slug>.sql, accompanied by a journal entry appended to _journal.json. The custom applier applyPendingMigrationsManually runs them in order.
Why: drizzle-kit's snapshot mechanism drifted when the Paperclip workspace_* tables were removed, causing it to emit a spurious 200-line migration on every invocation. Rather than fight the snapshot, the team adopts the hand-authored pattern already established for several migrations (0094–0097). Each migration is written as CREATE TABLE IF NOT EXISTS / ALTER TABLE IF EXISTS / UPDATE … WHERE … — idempotent by construction.
Pattern for new migrations:
- Write
packages/db/src/migrations/00NN_<slug>.sqlas idempotent SQL. - Append an entry to
_journal.json:{ "idx": NN, "version": "7", "tag": "00NN_<slug>", "breakpoints": true }. - Update the Drizzle schema file(s) to match.
- Run the server once to confirm the migration applies cleanly.
S-7 — "Operator" replaces "Board"
Status: Delivered 2026-06-07 (migration 0097_board_to_operator, all packages clean).
Decision: the instance-level human governance actor is called Operator, not "Board". The UI kanban task board deliberately keeps the word "Board" (it is a literal board, not a governance role).
Rationale: "Board" was inherited from Paperclip's multi-company model where a Board governed many Companies. SLAW's enterprise squad model has no multi-company governor; the per-instance full-control human actor is the Operator. Every tier now has a distinct noun: Botfather → Instance → Operator → Squad → Squad Lead → Agents.
Key changes:
- Actor type
"board"→"operator"; source"board_key"→"operator_key". - Local actor id
local-board→local-operator(rekeyed in0097). - DB table
board_api_keys→operator_api_keys; token prefixpcp_board_*→slaw_op_*. - Auth service, CLI auth, UI routes, and copy all renamed. See
DESIGN-board-to-operator.mdfor the complete rename map.
Upgrade note for existing dev instances: migration 0097 does structural renames only. An existing dev DB should be wiped and re-seeded (WIPE-AND-RESEED.sh) rather than migrated in place, because local-board user-id values are stored across ~40 columns and partial rekeys left the profile page broken in testing. Fresh instances seed directly as local-operator.
S-8 — Discipline leads replace C-suite roles
Status: Plan (not yet built) — DESIGN-squad-lead-leads-based-instructions.md.
Decision: rename the first-tier agent roles from C-suite (cto, cmo, cfo) to discipline leads (engineering_lead, marketing_lead, finance_lead). Rewrite the Squad Lead's default AGENTS.md to route work by discipline (Engineering Lead, Design Lead, Product Lead, Marketing Lead, QA Lead) rather than to a fixed C-suite org chart.
Why: the Squad model uses "Squad Lead", not "CEO". The first agent auto-created by the Squad Lead onboarding used CEO/CTO/CMO terminology inherited from Paperclip, which now conflicts with the language the rest of the product uses. The role column is plain text (S-10), so renaming is a simple UPDATE + constants change — no Postgres enum migration.
Note: SOUL.md (the persona file) is intentionally left untouched this pass.
S-9 — Dynamic adapter registry
Status: In progress (feat/external-adapter-phase1).
Decision: convert the server and UI adapter registries from static enum-backed maps to mutable registries (registerServerAdapter / registerUIAdapter / unregisterServerAdapter / unregisterUIAdapter). Runtime validation moves to the server routes (assertKnownAdapterType), so the shared schema accepts any non-empty string and the server is the source of truth.
Why: third-party and plugin-contributed adapters must be able to register at startup without forking the core codebase. The current enum-backed approach requires SLAW to enumerate every adapter at build time.
Scope of phase 1: mutable server + UI registries; AgentConfigForm and NewAgent derive their adapter lists from listUIAdapters(); shared validators accept open-ended input. Surfaces not yet touched: NewAgentDialog, OnboardingWizard, InviteLanding, plugin manifest loader.
S-10 — Role column is plain text, not a Postgres enum
Decision: agents.role is text NOT NULL DEFAULT 'general', not a Postgres ENUM type. Application-layer validation uses the AGENT_ROLES array in packages/shared/src/constants.ts.
Why: Postgres enum renames require ALTER TYPE … RENAME VALUE, which is a DDL lock. The plain-text column makes renaming values a safe UPDATE agents SET role = 'new_value' WHERE role = 'old_value' — the pattern used in S-8. There are no foreign key relationships keyed on the role value, so renames have zero cross-table impact.
Botfather
Decision inventory
| # | Decision | Status | Source |
|---|---|---|---|
| B-1 | Sovereignty-first: push not pull, metadata not content | Foundational | ARCHITECTURE.md §1 |
| B-2 | Auto-enrollment + approval queue (no end-user tokens) | Delivered 2026-06-06 | ARCHITECTURE.md §6 |
| B-3 | Fail-open for enrolled, fail-closed for unenrolled | Delivered 2026-06-06 | ARCHITECTURE.md §6.4 |
| B-4 | Identity on (machineId, instanceId) until EntraID | Foundational | ARCHITECTURE.md §3 |
| B-5 | At-least-once sync with cursor deduplication | Foundational | ARCHITECTURE.md §4.3 |
| B-6 | Centrally-governed budget limits via back-channel | Delivered 2026-06-06 | DESIGN-budget-limits.md |
| B-7 | Tower-mastered skill registry | Delivered 2026-06-07 | DESIGN-skill-registry.md |
| B-8 | One binary, one Postgres (mirror SLAW stack) | Foundational | ARCHITECTURE.md §1 |
| B-9 | Tower-authoritative config flows down, not up | Foundational | ARCHITECTURE.md §10 |
B-1 — Sovereignty-first: push not pull, metadata not content
Decision: SLAW instances must remain fully functional when Botfather is unreachable. All communication is initiated outbound by the instance over HTTPS — no inbound ports on desktops. The tower receives metadata and metrics only: names, statuses, counts, costs. Issue bodies, comments, agent configs, secrets, run logs, and code never leave the instance.
The sovereignty boundary is the product headline. What is synced: squad names/status, agent titles/roles/status, project names, issue titles + status (opt-out via reportIssueTitles: false), cost events. What never leaves: agent adapter config, secrets, issue bodies/comments, run output.
B-2 — Auto-enrollment + approval queue (no end-user tokens)
Status: Delivered 2026-06-06, replacing an earlier enrollment-token design.
Decision: on first start with a botfather.url configured, the instance self-enrolls by POST /api/ingest/v1/enroll with its identity (no token). The instance lands in pending in the tower's Approval Queue. An admin approves (manually, or via auto-approve rules matching hostname/machineId patterns). On approval, the tower issues a per-instance API key the instance stores in its secrets store. All subsequent calls use Bearer auth.
Why no end-user tokens: handing tokens to every end user creates a credential management problem at fleet scale. Auto-enrollment with admin-controlled approval gives the same security properties with zero user friction.
B-3 — Fail-open for enrolled, fail-closed for unenrolled
Decision: two enforcement modes, set via botfather.enforcement in instance config (delivered by IT/MDM):
enforce(default): a never-enrolled instance stays gated at the startup UI and keeps retrying until approved. An already-enrolled instance with a valid cached API key is allowed to run (fail-open for the enrolled) and keeps spooling until the tower returns. A working machine is not held hostage by tower downtime; a new machine cannot bypass the tower by pulling the network cable.advisory: the startup gate is dismissible; the instance reports best-effort. For pilots.
Why: enterprise policy requires that new endpoints be admitted to the fleet before they can run. But forcing enrolled instances offline during tower maintenance would create unacceptable disruption for the teams using them.
B-4 — Identity on (machineId, instanceId) until EntraID
Decision: the primary identity tuple is (machineId, instanceId). machineId is derived once per machine from the OS hardware ID (hashed with an app salt) and stored at ~/.slaw/machine.json — shared across all instances on one box. instanceId is the existing SLAW_INSTANCE_ID value (default "default"). Child entities are keyed as (machineId, instanceId, localId).
Future path: a nullable userPrincipal column is reserved on the instances table from day one. When EntraID arrives, the enrollment exchange will accept an Entra token, auto-approval will key off Entra group membership, and the UI will group by person. No schema migration is needed for the EntraID upgrade.
B-5 — At-least-once sync with cursor deduplication
Decision: the instance tracks a botfather_sync_state cursor table (per-entity updatedAt / max-id high-water mark). Fact events are read strictly above the last-acknowledged cursor. The sync response acknowledges each batch with the new cursor; the instance only advances its cursor on ack. Botfather deduplicates facts on (machineId, instanceId, localId), yielding effective exactly-once delivery despite at-least-once transport.
Why not exactly-once at the transport layer: HTTP over internet paths drops and retries. Exactly-once transport requires coordination that adds latency and complexity. Dedup on a stable natural key is cheaper and correct.
B-6 — Centrally-governed budget limits via back-channel
Status: Delivered 2026-06-06. See DESIGN-budget-limits.md for the complete protocol and implementation record.
Decision: Botfather governs cost (cents) and token ceilings fleet-wide. An enterprise-default limit is set once and propagates to all instances; per-instance overrides are available. The resolved limit rides the heartbeat/sync response directives array as a {kind:"set_limits", limit} directive — the existing push-only back-channel. No inbound connection is opened.
Key design points:
- Tower caps; local can be stricter. The tower limit is an additive ceiling on top of existing squad/agent budgets. Local budgets are never weakened.
- Metric is plan-aware. Cost metric for metered (API billing); token metric for subscription. One
LimitSpeccarries both ceilings; the enforcer picks the relevant one bybillingType. - Enforcement modes:
off|soft(warn only) |hard(block runs at ceiling). - SLAW-side: a new
instance_limitssingleton table stores the last-applied limit.getInvocationBlock()gains an instance-wide gate (all ~9 call sites inherit it).evaluateCostEvent()warns atwarnPercentand pauses at ceiling whenhard.
B-7 — Tower-mastered skill registry
Status: Delivered 2026-06-07. See DESIGN-skill-registry.md for the complete record.
Decision: Botfather is the single source of truth for skills across the fleet. Skills are authored and published in the tower's Skill Registry (skill_library table: key, name, description, markdown body, monotonic version, draft/publish/deprecate lifecycle). Connected instances pull the catalog via GET /api/ingest/v1/skills and install chosen skills onto local squads. The skills_updated directive in the heartbeat/sync back-channel hints that the catalog has advanced; the instance pulls on its own schedule.
Sovereignty constraint: the skill body flows tower → instance (governed content). Squad composition — which agents exist, what they do, who reports to whom — stays entirely local. Botfather has no opinion about squad structure.
Local authoring lock: when an instance is connected and enrolled, local skill create/import/edit routes return 409 skills_managed_by_tower. Standalone instances (no botfather.url) keep local authoring. Tower-managed skills display a "Control tower · vN" badge and are read-only.
B-8 — One binary, one Postgres (mirror SLAW stack)
Decision: Botfather reuses the SLAW technology stack (Node.js + Express + React + Vite + PostgreSQL + Drizzle + pnpm monorepo) rather than introducing a separate backend or cloud service.
Why: Botfather is self-hosted enterprise infrastructure. Operators who already run SLAW understand the stack. Reusing it means Botfather can share packages/shared code, follows the same migration discipline, and requires no new operational knowledge.
B-9 — Tower-authoritative config flows down, not up
Decision: all data normally flows instance → tower (squads, agents, issues, cost facts). Budget limits (B-6) and skill catalog (B-7) are explicit exceptions: they are tower-authoritative config that flows tower → instance. No instance can write to enterprise_limits, instance_limit_overrides, or skill_library — these are admin-only tables on the tower.
Why explicitly called out: the push-only, instance-initiated model is a security principle. Every tower-authoritative config item must be a justified exception: it must be a typed, bounded directive (never arbitrary code), transmitted through the existing response back-channel, and version-de-duped by the instance. New "tower pushes to instance" features must follow this pattern.
Cross-cutting
Stack and toolchain choices
| Concern | Choice | Notes |
|---|---|---|
| Package manager | pnpm 9 workspaces | Pin via packageManager field; corepack enable to activate |
| ORM | Drizzle | Schema as code; migrations are hand-authored (see S-6) |
| UI framework | React 19, Vite 6, TanStack Query, Radix UI, Tailwind CSS 4 | |
| Auth | Better Auth | Sessions + API keys |
| DB default | PGlite (embedded) | PostgreSQL 17 for production |
| Wire format | JSON over HTTPS, gzipped for ingest payloads | |
| Protocol versioning | Versioned envelope (protocolVersion); tower accepts N and N−1 |
Migration discipline
Both repos hand-author migrations. The pattern:
- Write an idempotent
.sqlfile inpackages/db/src/migrations/. - Append a journal entry to
_journal.json. - Apply via
applyPendingMigrationsManually(SLAW) ordrizzle-kit migrate(Botfather, which doesn't have the snapshot drift issue). - Never edit a shipped migration — write a new forward-only one instead.
Next steps
- Contributing — repo setup, build and test workflow, PR norms.
- How SLAW Works — product-facing architecture overview.
- Security and Sovereignty — the sovereignty boundary in detail.