railiance-cluster/workplans/RAIL-BS-WP-0004-safety-net.md
tegwick 5b0cfbf10a feat(backup): revise WP-0004 — integrated backup per capability (D4)
WP-0004 rewritten: scope narrowed to S2-owned assets (etcd snapshots,
Helm values, kubeconfig). No external dependencies. age encryption
reuses SOPS key pair. Output to /opt/backup/railiance/cluster/.

DECISIONS.md D4: integrated backup per capability, not centralized.
EP-RAIL-005 registered in state hub: custodian orchestration deferred
until all layers implement the standard interface.

The old monolithic backup (custodian DB + operator config) was not S2's
concern and has been removed from this workplan scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 17:43:30 +01:00

222 lines
5.9 KiB
Markdown

---
id: RAIL-BS-WP-0004
type: workplan
title: "Integrated Backup — S2 Kubernetes Runtime Layer"
domain: railiance
repo: railiance-cluster
status: active
owner: tegwick
topic_slug: railiance
state_hub_workstream_id: "7e8b0c20-51eb-40c9-9e3b-85dd380d7625"
created: "2026-02-25"
updated: "2026-03-10"
---
# Integrated Backup — S2 Kubernetes Runtime Layer
## Goal
Implement the Q3 (Operability & Resilience) integrated backup for
railiance-cluster (S2). Backs up what S2 owns — the Kubernetes runtime state —
encrypted with age, written to a local directory on the server. No external
dependencies required.
## Architecture (Decision D4)
Each railiance repo implements its own backup for what it owns. No central
backup service. See `DECISIONS.md` D4 for full rationale.
**Standard interface every railiance repo must provide:**
```bash
make backup # encrypt + write to /opt/backup/railiance/<layer>/
make restore # restore from most recent local backup
```
Encryption: age, same key pair as SOPS secrets (`.sops.yaml` public key).
Output: `/opt/backup/railiance/cluster/` on the server.
## What S2 (railiance-cluster) owns and must back up
| Asset | Why it matters |
|---|---|
| k3s etcd snapshots | Full cluster state — all workloads, configs, secrets |
| Helm release values | Runtime values not in git (any manually applied overrides) |
| kubeconfig | Admin access to the cluster |
**Not S2's responsibility:**
- Custodian State Hub DB → the-custodian owns this
- Operator workstation config (`.claude/`, `.gitconfig`) → operator's own concern
- Application data (Gitea repos, uploads) → S5 (railiance-apps) owns this
- PostgreSQL data volumes → S3 (railiance-platform) owns this
## Encryption
Reuse the age public key from `.sops.yaml`:
```bash
AGE_PUBLIC_KEY=$(grep 'age:' .sops.yaml | awk '{print $2}')
tar -czf - <assets> | age -r "${AGE_PUBLIC_KEY}" -o backup.tar.gz.age
```
Decryption requires the private key at `~/.config/sops/age/keys.txt`
(same key used for `sops -d`). No additional key management needed.
## Extension Point EP-RAIL-005
Once all five OAS layers implement this interface, the custodian can
orchestrate a full-stack backup with:
```bash
for repo in railiance-infra railiance-cluster railiance-platform \
railiance-enablement railiance-apps; do
make -C ~/$repo backup
done
```
No special protocol needed — just the standard interface.
---
## Tasks
### T01 — Define backup directory and encryption wrapper
```task
id: T01
status: todo
priority: high
state_hub_task_id: "4526a842-ea31-4874-9231-92ab556cfe7b"
```
Create `tools/cmd/railiance-backup-s2` (replacing the old `railiance-backup`):
- Backup dir: `/opt/backup/railiance/cluster/` (create with `mkdir -p`)
- Encrypt each artifact with age using public key from `.sops.yaml`
- Write timestamp-named files: `etcd-<ts>.snap.age`, `helm-values-<ts>.tar.gz.age`, `kubeconfig-<ts>.yaml.age`
- Keep last 7 of each type
- Write `.last-backup` stamp
- Exit 0 on success, non-zero on any failure
- No network required
**Done when:** `make backup` runs on COULOMBCORE without error and files
appear in `/opt/backup/railiance/cluster/`.
---
### T02 — Back up k3s etcd snapshots
```task
id: T02
status: todo
priority: high
state_hub_task_id: "a6313e06-1976-46a7-8e31-df4eb2eca880"
```
k3s has built-in etcd snapshot support:
```bash
sudo k3s etcd-snapshot save --name railiance-$(date -u +%Y%m%dT%H%M%SZ)
# Default location: /var/lib/rancher/k3s/server/db/snapshots/
```
Add to the backup script: take a fresh snapshot, encrypt with age,
copy to `/opt/backup/railiance/cluster/`.
**Done when:** backup includes a current etcd snapshot.
---
### T03 — Back up Helm release values
```task
id: T03
status: todo
priority: medium
state_hub_task_id: "05d42a55-921f-4aa7-bb76-e8af9c7e0ac3"
```
Capture current runtime Helm values for all releases:
```bash
helm list -A -o json | jq -r '.[].name + " " + .namespace' | \
while read name ns; do
helm get values "$name" -n "$ns" -o yaml
done
```
Tar and age-encrypt into `helm-values-<ts>.tar.gz.age`.
**Done when:** backup includes a snapshot of all Helm release values.
---
### T04 — Back up kubeconfig
```task
id: T04
status: todo
priority: medium
state_hub_task_id: "08233868-d522-4117-bc4e-6c0f52545665"
```
Age-encrypt `~/.kube/config-hosteurope` (or `/etc/rancher/k3s/k3s.yaml`)
into `kubeconfig-<ts>.yaml.age` in the backup directory.
**Done when:** backup includes the encrypted kubeconfig.
---
### T05 — make restore target
```task
id: T05
status: todo
priority: medium
state_hub_task_id: "2d5acff7-4a4e-4ddd-ad06-08237ad3dac8"
```
Add `tools/cmd/railiance-restore-s2` that decrypts and lists available
backups, with guided restore for the etcd snapshot case.
Restore of etcd from snapshot:
```bash
sudo k3s server --cluster-reset \
--cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<name>
```
**Done when:** `make restore` prints available backups and a restore guide.
---
### T06 — Install cron job and run restore drill
```task
id: T06
status: todo
priority: medium
state_hub_task_id: "f8e4a094-c367-40eb-b895-da17bc144b07"
```
Install the daily cron and verify decrypt works:
```bash
# Install cron on COULOMBCORE
(crontab -l 2>/dev/null; echo "0 2 * * * make -C ~/railiance-cluster backup >> /opt/backup/railiance/cluster/backup.log 2>&1") | crontab -
# Drill: decrypt etcd snapshot and verify it's readable
age -d -i ~/.config/sops/age/keys.txt \
/opt/backup/railiance/cluster/etcd-<latest>.snap.age | file -
```
**Done when:** cron installed, drill completes without error, log entry written.
---
## References
- Decision D4: Integrated backup per capability (`DECISIONS.md`)
- Decision D2: Nextcloud as optional offsite extension (still valid, not a requirement)
- OAS Q3: Operability & Resilience
- Extension point EP-RAIL-005: Custodian full-stack backup orchestration
- k3s etcd snapshots: https://docs.k3s.io/datastore/backup-restore