RAIL-HO-WP-0005-T09: Forgejo backup/restore drill assets and evidence

Add isolated-namespace restore drill (CNPG cluster, PVC, orchestration script) and document successful 2026-07-04 run: production forgejo dump restored with health 200 and pilot repos visible via API. Scheduled backups remain open.
2026-07-04 11:26:50 +02:00 · 2026-07-04 11:26:50 +02:00 · 092315895f
commit 092315895f
parent 2d62317ada
5 changed files with 257 additions and 5 deletions
--- a/docs/forgejo-restore-drill-evidence.md
+++ b/docs/forgejo-restore-drill-evidence.md
@ -0,0 +1,93 @@
+# Forgejo Backup/Restore Drill Evidence
+
+Date: 2026-07-04  
+Workplan: RAIL-HO-WP-0005  
+Task: RAIL-HO-WP-0005-T09  
+`no_secret_material_recorded: true`
+
+## Purpose
+
+Prove that a production `forgejo dump` can be restored into an isolated
+namespace and serve repository metadata without touching production Forgejo or
+Gitea.
+
+## Backup source
+
+| Field | Value |
+| --- | --- |
+| Method | `forgejo dump` from production pod |
+| Production pod | `forgejo-gitea-64c5b57684-ph9vt` (namespace `forgejo`) |
+| Archive path (workstation) | `/tmp/forgejo-drill/forgejo-drill-backup.zip` |
+| Archive size | 12,284,847 bytes (~11.7 MiB) |
+| Archive timestamp | 2026-07-04 11:20 +0200 |
+| Archive contents (top-level) | `repos/`, `data/`, `forgejo-db.sql`, `app.ini` |
+
+Repos present in dump: `forgejo-actions-probe`, `glas-harness`, `key-cape`
+(all under `repos/coulomb/`).
+
+## Restore target
+
+| Field | Value |
+| --- | --- |
+| Namespace | `forgejo-restore-drill` |
+| Database | CNPG cluster `forgejo-db-restore` (isolated, 1 instance) |
+| App data PVC | `forgejo-restore-data` (`local-path`, 10Gi) |
+| Helm release | `forgejo-restore` (`gitea-charts/gitea` 12.5.0) |
+| Orchestration | `tools/forgejo-restore-drill.sh` |
+
+Restore path (Forgejo 11.0.3 has no `forgejo restore` CLI):
+
+1. Unzip dump into import pod staging area.
+2. Copy `repos/` → `/data/git/gitea-repositories/`.
+3. Copy `data/` → `/data/` (packages, attachments, avatars).
+4. Import `forgejo-db.sql` via `psql` into `forgejo-db-restore`.
+5. Deploy isolated Helm release bound to restored PVC + restore DB host.
+
+## Post-restore checks (2026-07-04)
+
+Port-forward: `svc/forgejo-restore-gitea-http` → `127.0.0.1:13000`
+
+| Check | Result |
+| --- | --- |
+| `GET /` health | HTTP 200 |
+| `GET /api/v1/repos/coulomb/glas-harness` | `full_name=coulomb/glas-harness`, `default_branch=main` |
+| `GET /api/v1/repos/coulomb/key-cape` | `full_name=coulomb/key-cape`, `default_branch=main` |
+| `GET /api/v1/orgs/coulomb/repos` | 3 repos: `forgejo-actions-probe`, `glas-harness`, `key-cape` |
+
+Script exit marker: `restore-drill-complete`
+
+## RPO / RTO (drill scope)
+
+| Metric | Observed / assumed |
+| --- | --- |
+| RPO (manual dump) | Point-in-time of `forgejo dump` execution; no scheduled backup yet |
+| RTO (isolated restore) | ~3–5 minutes for CNPG ready + import + Helm deploy on railiance01 |
+| Production impact | None — read-only dump from running pod; separate namespace |
+
+## Gaps (not closed by this drill)
+
+- **Scheduled backups:** CNPG `Backup` CRs and off-cluster target not configured
+  (`kubectl cnpg` plugin absent on workstation).
+- **Encryption at rest:** dump stored locally on workstation for drill only; no
+  approved backup target wired.
+- **Automation:** `forgejo dump` is manual; T04/T09 still need cron/operator
+  schedule and retention policy (T02 decision).
+- **Re-run hygiene:** concurrent or repeat runs require `DRILL_CLEAN=1` to wipe
+  `forgejo-restore-drill` before import (SQL import is not idempotent).
+
+## Cleanup
+
+After evidence capture, delete the drill namespace:
+
+```bash
+kubectl delete namespace forgejo-restore-drill --wait=true
+```
+
+Production Forgejo (`forgejo` namespace) and Gitea remain unchanged.
+
+## References
+
+- `infra/forgejo-restore-drill/forgejo-db-restore-cluster.yaml`
+- `infra/forgejo-restore-drill/restore-job.yaml`
+- `tools/forgejo-restore-drill.sh`
+- `workplans/RAIL-HO-WP-0005-forgejo-production-migration.md` (T09)
--- a/infra/forgejo-restore-drill/forgejo-db-restore-cluster.yaml
+++ b/infra/forgejo-restore-drill/forgejo-db-restore-cluster.yaml
@ -0,0 +1,21 @@
+---
+apiVersion: postgresql.cnpg.io/v1
+kind: Cluster
+metadata:
+  name: forgejo-db-restore
+  namespace: forgejo-restore-drill
+  labels:
+    app.kubernetes.io/name: forgejo-db-restore
+    railiance.io/layer: s3-platform
+    railiance.io/consumer: forgejo-restore-drill
+spec:
+  instances: 1
+  imageName: ghcr.io/cloudnative-pg/postgresql:16
+  storage:
+    size: 10Gi
+  bootstrap:
+    initdb:
+      database: forgejo
+      owner: forgejo
+      secret:
+        name: forgejo-db-credentials
--- a/infra/forgejo-restore-drill/restore-job.yaml
+++ b/infra/forgejo-restore-drill/restore-job.yaml
@ -0,0 +1,12 @@
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: forgejo-restore-data
+  namespace: forgejo-restore-drill
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 10Gi
+  storageClassName: local-path
--- a/tools/forgejo-restore-drill.sh
+++ b/tools/forgejo-restore-drill.sh
@ -0,0 +1,115 @@
+#!/usr/bin/env bash
+# Non-production Forgejo backup/restore drill (RAIL-HO-WP-0005-T09).
+# Re-run: DRILL_CLEAN=1 ./tools/forgejo-restore-drill.sh  (wipes namespace first)
+set -euo pipefail
+
+KUBECONFIG="${KUBECONFIG:-$HOME/.kube/config-hosteurope}"
+export KUBECONFIG
+NS=forgejo-restore-drill
+DRILL_CLEAN="${DRILL_CLEAN:-0}"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+ROOT_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
+BACKUP_LOCAL="${BACKUP_LOCAL:-/tmp/forgejo-drill/forgejo-drill-backup.zip}"
+PROD_POD="${PROD_POD:-$(kubectl get pods -n forgejo -l app.kubernetes.io/instance=forgejo -o jsonpath='{.items[0].metadata.name}')}"
+
+step() { echo "==> $*"; }
+
+if [[ "${DRILL_CLEAN}" == "1" ]]; then
+  step "Clean prior drill namespace ${NS}"
+  kubectl delete namespace "${NS}" --wait=true --timeout=5m || true
+fi
+
+step "Create namespace ${NS}"
+kubectl create namespace "${NS}" --dry-run=client -o yaml | kubectl apply -f -
+
+step "Copy forgejo-db-credentials into ${NS}"
+kubectl get secret forgejo-db-credentials -n databases -o json \
+  | python3 -c "import json,sys; s=json.load(sys.stdin); s['metadata']={k:v for k,v in s['metadata'].items() if k in ('name','labels','annotations')}; s['metadata']['namespace']='${NS}'; print(json.dumps(s))" \
+  | kubectl apply -f -
+
+step "Deploy restore CNPG cluster"
+kubectl apply -f "${ROOT_DIR}/infra/forgejo-restore-drill/forgejo-db-restore-cluster.yaml"
+kubectl wait --for=condition=Ready cluster/forgejo-db-restore -n "${NS}" --timeout=10m
+
+step "Ensure local backup exists"
+if [[ ! -f "${BACKUP_LOCAL}" ]]; then
+  kubectl exec -n forgejo "${PROD_POD}" -c gitea -- forgejo dump -f /tmp/forgejo-drill-backup.zip
+  mkdir -p "$(dirname "${BACKUP_LOCAL}")"
+  kubectl cp "forgejo/${PROD_POD}:/tmp/forgejo-drill-backup.zip" "${BACKUP_LOCAL}" -c gitea
+fi
+ls -lh "${BACKUP_LOCAL}"
+
+step "Apply restore PVC"
+kubectl apply -f "${ROOT_DIR}/infra/forgejo-restore-drill/restore-job.yaml"
+
+step "Run restore pod (stage backup, import files + SQL)"
+kubectl delete pod forgejo-restore-import -n "${NS}" --ignore-not-found --wait=true
+cat <<EOF | kubectl apply -f -
+apiVersion: v1
+kind: Pod
+metadata:
+  name: forgejo-restore-import
+  namespace: ${NS}
+spec:
+  restartPolicy: Never
+  containers:
+    - name: restore
+      image: code.forgejo.org/forgejo/forgejo:11.0.3
+      command: ["sleep", "3600"]
+      volumeMounts:
+        - name: data
+          mountPath: /data
+        - name: backup
+          mountPath: /backup
+  volumes:
+    - name: data
+      persistentVolumeClaim:
+        claimName: forgejo-restore-data
+    - name: backup
+      emptyDir: {}
+EOF
+kubectl wait --for=condition=Ready pod/forgejo-restore-import -n "${NS}" --timeout=3m
+kubectl cp "${BACKUP_LOCAL}" "${NS}/forgejo-restore-import:/backup/forgejo-drill-backup.zip" -c restore
+DB_PASS="$(kubectl get secret forgejo-db-credentials -n "${NS}" -o jsonpath='{.data.password}' | base64 -d)"
+kubectl exec -n "${NS}" forgejo-restore-import -c restore -- env POSTGRES_PASSWORD="${DB_PASS}" sh -c '
+set -eu
+apk add --no-cache unzip postgresql-client >/dev/null
+rm -rf /data/*
+mkdir -p /data/git/gitea-repositories
+unzip -q /backup/forgejo-drill-backup.zip -d /tmp/dump
+cp -a /tmp/dump/repos/. /data/git/gitea-repositories/
+cp -a /tmp/dump/data/. /data/
+chown -R git:git /data
+PGPASSWORD="${POSTGRES_PASSWORD}" psql -h forgejo-db-restore-rw.forgejo-restore-drill.svc.cluster.local -U forgejo -d forgejo -v ON_ERROR_STOP=1 -f /tmp/dump/forgejo-db.sql
+echo restore-import-ok
+'
+unset DB_PASS
+kubectl delete pod forgejo-restore-import -n "${NS}" --wait=true
+
+step "Deploy isolated Forgejo release"
+cd "${HOME}/railiance-apps"
+DB_PASS="$(kubectl get secret forgejo-db-credentials -n "${NS}" -o jsonpath='{.data.password}' | base64 -d)"
+helm upgrade --install forgejo-restore gitea-charts/gitea --version 12.5.0 \
+  --namespace "${NS}" --create-namespace \
+  -f helm/forgejo-values.yaml \
+  -f helm/forgejo-registry-values.yaml \
+  --set strategy.type=Recreate \
+  --set persistence.existingClaim=forgejo-restore-data \
+  --set gitea.config.database.HOST=forgejo-db-restore-rw.${NS}.svc.cluster.local:5432 \
+  --set gitea.config.database.PASSWD="${DB_PASS}" \
+  --set gitea.config.server.DOMAIN=forgejo-restore.local \
+  --set gitea.config.server.ROOT_URL=http://forgejo-restore.local:3000/ \
+  --set gitea.admin.password=restore-drill-local-only \
+  --set ingress.enabled=false \
+  --wait --timeout=10m
+unset DB_PASS
+
+step "Post-restore checks via port-forward"
+kubectl port-forward -n "${NS}" svc/forgejo-restore-gitea-http 13000:3000 >/tmp/forgejo-restore-pf.log 2>&1 &
+PF_PID=$!
+sleep 5
+curl -fsS -o /dev/null -w 'health:%{http_code}\n' http://127.0.0.1:13000/
+curl -fsS http://127.0.0.1:13000/api/v1/repos/coulomb/glas-harness | python3 -c "import json,sys; d=json.load(sys.stdin); print('repo', d.get('full_name'), d.get('default_branch'))"
+curl -fsS http://127.0.0.1:13000/api/v1/repos/coulomb/key-cape | python3 -c "import json,sys; d=json.load(sys.stdin); print('repo', d.get('full_name'), d.get('default_branch'))"
+kill "${PF_PID}" 2>/dev/null || true
+echo "restore-drill-complete"
--- a/workplans/RAIL-HO-WP-0005-forgejo-production-migration.md
+++ b/workplans/RAIL-HO-WP-0005-forgejo-production-migration.md
@ -109,7 +109,8 @@ acceptance criteria below are tracked across T05, T07, T08, and T10 instead.
 Still to prove before T11:

 - SMTP/password reset end-to-end (T06).
- Backup and restore in isolated namespace (T09).
+- Backup and restore in isolated namespace (T09) — **drill passed 2026-07-04**;
+  scheduled automation pending.
 - Issues/releases/wiki/LFS per inventory classification (T10 matrix).
 - Operator SSH identity on Forgejo beyond interim `forgejo_admin` keys (T02/T10).

@ -377,7 +378,7 @@ a pullable image without privileged cluster-wide credentials. **Tier 2: done.**

 ```task
 id: RAIL-HO-WP-0005-T09
-status: todo
+status: progress
 priority: high
 state_hub_task_id: "25892007-36ca-4bd9-8adf-84d505465d7d"
 ```
@ -402,8 +403,17 @@ Acceptance:
 - Restore into an isolated namespace is drilled and documented.
 - RPO/RTO expectations are recorded.

+**Partial (2026-07-04):** isolated restore drill **passed**. Production
+`forgejo dump` (~11.7 MiB) restored into `forgejo-restore-drill` namespace;
+post-restore API checks: health 200, `coulomb/glas-harness` and
+`coulomb/key-cape` on `main`, 3 org repos visible. Evidence:
+`docs/forgejo-restore-drill-evidence.md`. Assets: `infra/forgejo-restore-drill/`,
+`tools/forgejo-restore-drill.sh`. Remaining: scheduled CNPG/off-cluster backups,
+encryption/approved target (T02/T04), automated dump schedule.
+
 **Done when:** a fresh backup restores to a working isolated Forgejo instance
-with repository, package, and user recovery checks passing.
+with repository, package, and user recovery checks passing **and** scheduled
+backups run without manual intervention.

 ---

@ -520,8 +530,9 @@ T05+T08 ──► T10 migration ladder ──► T11 production cutover ──
 T03 isolated probe: CANCELLED (superseded by T05 + in-production pilots)
 ```

-**Current focus (2026-07-04):** T10 tiers 0–2 **complete**; T09 backup drill
-and T02 open decisions (SMTP, backup target) before tier-3 production repos.
+**Current focus (2026-07-04):** T10 tiers 0–2 **complete**; T09 restore drill
+**passed** (scheduled backups + backup target still open); T02 decisions (SMTP,
+backup target) before tier-3 production repos.
 Do not start T11 `state-hub` until T09 complete and `CUST-WP-0054` Wave-1
 gates satisfied.