railiance-cluster/DECISIONS.md

# Decision Log

_Auto-generated by the Custodian State Hub._

## D1 — Ingress controller: Traefik (K3s default) vs Nginx for ThreePhoenix

**Date:** 2026-02-25  
**Decided by:** Tegwick  

I want to go with C and separate concerns. Nginx for external SSL will need security and functional updates on a completly different schedule to Traefik canary and production workload splitting. The second area of implementation is more complicated, volatile and will need time to settle.

---

## D2 — Durable offsite backup destination for single-server safety net

**Date:** 2026-02-25
**Decided by:** Tegwick

We will use cloud storage the backup should be encypted to be safe regardless of the location and provider and for starters I will provide a nextcloud upload space as a backend.

---

## D3 — HA and failover scenarios must be tested before a workplan is considered done

**Date:** 2026-03-10
**Decided by:** Tegwick

On 2026-03-10 a PostgreSQL HA failover exposed a bug (missing `pgpool-password`
secret key) that had been present since initial deployment on 2025-08-31 but was
never discovered because no pod restart had occurred in 20 days. The immediate
symptom was Gitea logins hanging silently for hours.

This incident showed that deploying an HA component and declaring it "done"
without ever triggering a failover gives false confidence. Infrastructure that
has never failed over is not HA — it is just redundant hardware.

**Policy:**

Any workplan that deploys or configures a High Availability component
(database cluster, replicated storage, redundant ingress, etc.) is **not
complete** until a failover test passes. Specifically:

1. A test script in `tests/` must exist that deliberately kills the primary
   component and asserts the service remains available within an acceptable
   recovery window.

2. The test must be run against a live cluster and exit 0 before the workplan
   status is set to `completed`.

3. Smoke tests (`tests/smoke_kube.sh` or equivalent) must include a health
   check for each HA component's connection pooler, proxy, or load balancer —
   not just the backing nodes.

4. Any Helm chart values required to make HA work correctly (secrets,
   passwords, topology settings) must be present in the versioned values file
   before the workplan is closed, so that a `helm upgrade` cannot silently
   regress the fix.

**Rationale:** A failure that only surfaces on the first real event (restart,
failover, node loss) is a deployment bug, not an operational surprise. Railiance
aims for calm ops — and calm ops requires that every failure mode we know about
has been tested before it matters.

See: `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md`

---
feat: backup + preflight commands, decisions log, gitignore update - tools/cmd/railiance-backup: pg_dump + config snapshot, age-encrypted, uploaded to Nextcloud file drop via curl PUT. Daily cron target. - tools/cmd/railiance-preflight: pre-migration safety gate — checks backup freshness, all repos clean/pushed, age key present. - bin/railiance: added backup and preflight subcommands. - DECISIONS.md: decision log (D1 ingress Nginx+Traefik, D2 Nextcloud backup). - .gitignore: exclude backup-dropoff-link files (contain upload tokens). - CLAUDE.md: state hub session protocol update. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-02-25 23:59:28 +01:00			`# Decision Log`

			`_Auto-generated by the Custodian State Hub._`

			`## D1 — Ingress controller: Traefik (K3s default) vs Nginx for ThreePhoenix`

			`Date: 2026-02-25`
			`Decided by: Tegwick`

			`I want to go with C and separate concerns. Nginx for external SSL will need security and functional updates on a completly different schedule to Traefik canary and production workload splitting. The second area of implementation is more complicated, volatile and will need time to settle.`

			`---`

			`## D2 — Durable offsite backup destination for single-server safety net`

bug(gitea): report pgpool CrashLoopBackOff on HA failover + D3 testing policy Add RAIL-BS-WP-0003 documenting the 2026-03-10 incident where a PostgreSQL HA failover caused pgpool to enter CrashLoopBackOff due to a missing pgpool-password key in the gitea-postgresql-ha-postgresql secret — a bug present since initial deployment but hidden by the lack of any pod restart. Add Decision D3: HA and failover scenarios must be tested before a workplan is considered done. Any HA component deployment requires a passing failover test script in tests/ and complete Helm values before status = completed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-03-10 13:03:36 +00:00			`Date: 2026-02-25`
			`Decided by: Tegwick`
feat: backup + preflight commands, decisions log, gitignore update - tools/cmd/railiance-backup: pg_dump + config snapshot, age-encrypted, uploaded to Nextcloud file drop via curl PUT. Daily cron target. - tools/cmd/railiance-preflight: pre-migration safety gate — checks backup freshness, all repos clean/pushed, age key present. - bin/railiance: added backup and preflight subcommands. - DECISIONS.md: decision log (D1 ingress Nginx+Traefik, D2 Nextcloud backup). - .gitignore: exclude backup-dropoff-link files (contain upload tokens). - CLAUDE.md: state hub session protocol update. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-02-25 23:59:28 +01:00
			`We will use cloud storage the backup should be encypted to be safe regardless of the location and provider and for starters I will provide a nextcloud upload space as a backend.`

			`---`
bug(gitea): report pgpool CrashLoopBackOff on HA failover + D3 testing policy Add RAIL-BS-WP-0003 documenting the 2026-03-10 incident where a PostgreSQL HA failover caused pgpool to enter CrashLoopBackOff due to a missing pgpool-password key in the gitea-postgresql-ha-postgresql secret — a bug present since initial deployment but hidden by the lack of any pod restart. Add Decision D3: HA and failover scenarios must be tested before a workplan is considered done. Any HA component deployment requires a passing failover test script in tests/ and complete Helm values before status = completed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-03-10 13:03:36 +00:00
			`## D3 — HA and failover scenarios must be tested before a workplan is considered done`

			`Date: 2026-03-10`
			`Decided by: Tegwick`

			On 2026-03-10 a PostgreSQL HA failover exposed a bug (missing `pgpool-password`
			`secret key) that had been present since initial deployment on 2025-08-31 but was`
			`never discovered because no pod restart had occurred in 20 days. The immediate`
			`symptom was Gitea logins hanging silently for hours.`

			`This incident showed that deploying an HA component and declaring it "done"`
			`without ever triggering a failover gives false confidence. Infrastructure that`
			`has never failed over is not HA — it is just redundant hardware.`

			`Policy:`

			`Any workplan that deploys or configures a High Availability component`
			`(database cluster, replicated storage, redundant ingress, etc.) is **not`
			`complete** until a failover test passes. Specifically:`

			1. A test script in `tests/` must exist that deliberately kills the primary
			`component and asserts the service remains available within an acceptable`
			`recovery window.`

			`2. The test must be run against a live cluster and exit 0 before the workplan`
			status is set to `completed`.

			3. Smoke tests (`tests/smoke_kube.sh` or equivalent) must include a health
			`check for each HA component's connection pooler, proxy, or load balancer —`
			`not just the backing nodes.`

			`4. Any Helm chart values required to make HA work correctly (secrets,`
			`passwords, topology settings) must be present in the versioned values file`
			before the workplan is closed, so that a `helm upgrade` cannot silently
			`regress the fix.`

			`Rationale: A failure that only surfaces on the first real event (restart,`
			`failover, node loss) is a deployment bug, not an operational surprise. Railiance`
			`aims for calm ops — and calm ops requires that every failure mode we know about`
			`has been tested before it matters.`

			See: `workplans/RAIL-BS-WP-0003-pgpool-ha-failover-fix.md`

			`---`