This is the operator’s bench manual — the page you keep open in a tab when something is on fire and you need the exact command. It assumes you haveDocumentation Index
Fetch the complete documentation index at: https://alfred.black/docs/llms.txt
Use this file to discover all available pages before exploring further.
aas (the ctrl CLI) on your laptop, the per-instance SSH keys in data/ssh_keys/, and Tailscale up.
Pre-flight: shape of the system
One SaaS host, many tenant VPSes, all tied together by a Tailscale tailnet.| Plane | What runs there | How you reach it |
|---|---|---|
| SaaS host | Wasp app at alfred.black, ctrl-api (SaaS-side), Caddy, Plausible, ClickHouse | SSH (key in 1Password) + browser |
| Tenant VPS | Hetzner cx53, ~12 long-running containers (mid-30s with sidecars), per-instance LUKS volume | aas ssh <name>, aas run <name> '<cmd>', or via Tailscale https://<tailscale-hostname>:8233 |
alfred-update.timer— fires every 15 minutes,docker compose pull && docker compose up -d. New images on DockerHub roll out fleet-wide within the quarter-hour. You almost never need to push images by hand.alfred-backup.timer— fires daily at 03:00 host-local. Runsrestic backup /mnt/encrypted(LUKS volume, mounted) to Hetzner Object Storage with the Restic password from/opt/alfred/restic.env. Retention is 7 daily / 4 weekly / 6 monthly.
127.0.0.1. Infrastructure has the full picture.
Provisioning a new tenant
End-to-end provisioning takes 3 minutes from the golden snapshot, 18–20 minutes from a clean cloud-init build. Two paths:Path A — SaaS dashboard (preferred)
The Polar webhook auto-provisions a tenant when a subscriber checks out. To trigger by hand (manual sale, comp account, or replay after a failed checkout), open the SaaS admin page and use Provision New User — backed by theinstanceOperations action that creates a ProvisioningJob and hands it to the SaaS-side ctrl runner.
Path B — ctrl CLI
--snapshot auto resolves to the latest golden snapshot and shaves provisioning down to ~3 minutes. Without it, cloud-init runs from scratch.
The TUI’s Provision flow walks the same numbered steps as provisioner.ts:
Hetzner: SSH key + firewall + volume + server
Shared firewall labelled
managed-by: alfred-ctrl allows only SSH and ICMP.Cloud-init
LUKS2 format, fail2ban, UFW, Docker, Tailscale,
alfred-update.timer, alfred-backup.timer.Upload secrets via SSH
.env is SFTP’d post-boot — never via cloud-init user_data (readable through Hetzner metadata).Render + start docker-compose
init scaffolds the vault, openclaw and openclaw-workers boot, then alfred and alfred-learn.provisioning or flips to error. Re-running with the same name is safe — the provisioner is idempotent against existing Hetzner resources matching the label.
Monitoring health
Three layers, used in this order:1. Fleet view (laptop)
npm start), the dashboard screen shows the same grid with green/yellow/red status pills.
2. Per-tenant API (from the SaaS host)
The combined dashboard endpoint returns containers + service health + vault stats in one shot:GET /api/v1/admin/containers, GET /api/v1/admin/system/info (CPU/mem/disk), GET /api/v1/workers/status, GET /api/v1/workflows, GET /api/v1/schedules. See Monitoring for the full table.
3. On-box (SSH)
https://<tailscale-hostname>:8233). aas temporal-ui <name> --open opens it for you.
Common interventions
Everything below assumes you’ve identified the tenant.<name> is the customer name in aas list.
Restart a stuck workflow
POST /api/v1/workflows/<wfId>/terminate, then POST /api/v1/schedules/<schId>/trigger.
Pause/unpause a schedule
temporal schedule toggle --schedule-id <id> --pause / --unpause.
Restart a container
POST /api/v1/admin/containers/:service/restart. Restarting openclaw or ctrl-api from the API requires the X-Confirm-Self-Restart: yes header — this is the guardrail that keeps an agent from killing its own session.
Force the 15-minute pull early
Clear a sidecar cursor
When a stream pull cursor or sync cursor is stuck on a bad event and re-trying it forever, drop the cursor file to force a full rescan. See Sidecar State for the full file inventory and atomic-rename pattern. Quick reference for the Plane sync cursor:Restart alfred-learn (chore reload, etc.)
ready_after_seconds so you know when it’s back.
Failure modes
The recurring ones we’ve actually hit, with triage steps.Stream pull producing 13 MB results (Composio gmail)
Stream pull producing 13 MB results (Composio gmail)
Composio
composio_pull for Gmail with full HTML bodies has been observed to return >4 MB activity results, blowing past Temporal’s gRPC frame limit. Downstream effect: the activity completes from Composio’s side but Temporal can’t ingest the result, the worker retries and chokes, and signal_extract activities behind it time out.Symptoms:composio_pullactivities inRetryState_Unspecifiedfor minutesgrpc: received message larger than maxinalfred-learnlogs- inbox count stops growing on a tenant whose other streams still flow
- Pause the Gmail stream puller schedule:
- Terminate any in-flight
StreamPullerWorkflowfor that stream. - Cap the result size at the activity boundary (current workaround is per-tenant config; longer-term fix lives in
packages/learn/src/activities/composio_pull.py). - Unpause once the cap is in place.
signal_extract not draining
signal_extract not draining
Sidecar count flat for hours, If the task queue shows zero pollers, restart
extract_failed Activity task timed out in logs.Common causes:- Composio pull starvation (above) — a hot stream is holding the worker.
- Grok timeout — the LLM call inside
signal_extractexceeded its activity heartbeat budget. - Worker activation lag —
alfred-learnrecently restarted and Temporal is replaying history.
alfred-learn. If pollers are present but extract activities keep timing out, the upstream LLM is the culprit — switch the model in openclaw.json (via PATCH /api/v1/admin/config/openclaw) and let the next tick recover.Plane out of sync
Plane out of sync
Tasks created in vault don’t show up in Plane, or vice versa.Triage:If a single nudge doesn’t unstick it, drop the cursor and force a full rescan:The next forward sync re-walks the vault and re-adopts existing Plane issues by hash. See Plane Integration for the cursor schema.
Vexa bot not joining meetings
Vexa bot not joining meetings
Sir books a Meet, the bot doesn’t show up.Three causes, in order of probability:
VEXA_ENABLED=false(default off — david-only flag). Check/api/v1/admin/config/env.- Google Calendar stream disabled or paused. Check the streams list — without a live gcal stream,
meeting_capturehas nothing to schedule from. - Sir’s Google Calendar OAuth token expired. The Connected Apps page on the dashboard shows the connection state.
meeting_capture.no_events lookahead=... means it’s running but found nothing to dispatch. meeting_capture.skip_malformed gcal=... means events are coming through but missing a Meet link. meeting_capture.dispatch_failed means it tried and Vexa rejected — check the vexa-api-gateway logs.Workflow stuck in 'Running' forever
Workflow stuck in 'Running' forever
The classic non-determinism error: a recent code change renamed an activity, reordered logic, or changed a workflow signature without Temporal will spawn a fresh run from the schedule on the next tick. This was burned into us during the PR #628 incident (paginate
workflow.patched(). In-flight workflows started under the old code hit NonDeterministicError on replay and stall.Symptoms:- Workflow History stops advancing mid-run
alfred-learnlogs containNonDeterministicError- Schedules tied to that workflow start to back up
plane_sync.fetch_changed_tasks renamed activities and rewrote logic without a patch). The protocol for safe workflow rewrites lives at packages/learn/CLAUDE.md — read it before merging anything that touches packages/learn/src/workflows/** or packages/learn/src/activities/**.Backups + recovery
alfred-backup.timer runs restic backup /mnt/encrypted daily at 03:00 host-local. Destination is Hetzner Object Storage; credentials are in /opt/alfred/restic.env on the box and mirrored to data/ssh_keys/<id>/restic.env on the operator’s laptop.
npm start for the current ctrl recovery flow before relying on a single command):
data/ssh_keys/<old-id>/luks.key, and restic-restore /mnt/encrypted from the most recent good snapshot. Even if Hetzner loses the disk entirely, the keyfile + restic snapshot is enough to reconstruct.
Image rollback
Every healthy run records the current image digests. If a fresh:latest tag regresses, the ctrl CLI tracks the last-healthy SHA per tenant and can roll back:
aas rollback doesn’t fit the situation: SSH to the tenant, pin the image tag in /opt/alfred/compose/.env (e.g. OPENCLAW_IMAGE_TAG=sha-<digest>), and trigger an update:
Cross-tenant audit (FleetAuditWorkflow)
Runs daily at 02:00 UTC. Walks every tenant’s stream metadata looking for cross-tenant contamination — events that landed in the wrong tenant’s vault, or connections that resolve to a differentCOMPOSIO_USER_ID than expected. Findings surface in the Daily Digest.
Disable per-tenant if it’s noisy or you’re investigating: set FLEET_AUDIT_ENABLED=false in .env (PATCH /api/v1/admin/config/env) and restart alfred-learn. The schedule registrar removes the al-fleet-audit schedule on next worker boot when the flag is off.
When to escalate
One tenant’s broken — that’s a weekday. The list below is when you stop and call:- Fleet-wide image regression. Multiple tenants flip yellow/red within minutes of each other after a deploy. Roll the image back across the fleet (
aas rollback --allif available, otherwise loop overaas list --status running) and only then dig in. - Composio outage. Every tenant’s Composio-backed streams and
composio_executecalls fail simultaneously. Checkstatus.composio.dev. There is nothing to fix on our side — pause hot stream pullers to stop hammering the API and wait it out. - Hetzner API outage.
aas list --status runningworks (DB is local), butaas health, provisioning, or volume operations all 5xx. Checkstatus.hetzner.com. No mitigation other than back off and try later. - Tailscale tailnet partition. SaaS proxy returns 502 for every tenant; SSH still works. Check
status.tailscale.comand the tailnet admin console — a key rotation or ACL change can knock the whole proxy path out.
Monitoring
Every endpoint and dashboard tab the runbook references
Security
Network model, encryption, and the isolation invariants
Infrastructure
The full provisioning order and Docker stack
Sidecar State
Cursor files, atomic-rename pattern, and recovery