From 092315895fc801825fb7b16411255d896769149f Mon Sep 17 00:00:00 2001 From: tegwick Date: Sat, 4 Jul 2026 11:26:50 +0200 Subject: [PATCH] RAIL-HO-WP-0005-T09: Forgejo backup/restore drill assets and evidence Add isolated-namespace restore drill (CNPG cluster, PVC, orchestration script) and document successful 2026-07-04 run: production forgejo dump restored with health 200 and pilot repos visible via API. Scheduled backups remain open. --- docs/forgejo-restore-drill-evidence.md | 93 ++++++++++++++ .../forgejo-db-restore-cluster.yaml | 21 ++++ infra/forgejo-restore-drill/restore-job.yaml | 12 ++ tools/forgejo-restore-drill.sh | 115 ++++++++++++++++++ ...HO-WP-0005-forgejo-production-migration.md | 21 +++- 5 files changed, 257 insertions(+), 5 deletions(-) create mode 100644 docs/forgejo-restore-drill-evidence.md create mode 100644 infra/forgejo-restore-drill/forgejo-db-restore-cluster.yaml create mode 100644 infra/forgejo-restore-drill/restore-job.yaml create mode 100755 tools/forgejo-restore-drill.sh diff --git a/docs/forgejo-restore-drill-evidence.md b/docs/forgejo-restore-drill-evidence.md new file mode 100644 index 0000000..adceb3b --- /dev/null +++ b/docs/forgejo-restore-drill-evidence.md @@ -0,0 +1,93 @@ +# Forgejo Backup/Restore Drill Evidence + +Date: 2026-07-04 +Workplan: RAIL-HO-WP-0005 +Task: RAIL-HO-WP-0005-T09 +`no_secret_material_recorded: true` + +## Purpose + +Prove that a production `forgejo dump` can be restored into an isolated +namespace and serve repository metadata without touching production Forgejo or +Gitea. + +## Backup source + +| Field | Value | +| --- | --- | +| Method | `forgejo dump` from production pod | +| Production pod | `forgejo-gitea-64c5b57684-ph9vt` (namespace `forgejo`) | +| Archive path (workstation) | `/tmp/forgejo-drill/forgejo-drill-backup.zip` | +| Archive size | 12,284,847 bytes (~11.7 MiB) | +| Archive timestamp | 2026-07-04 11:20 +0200 | +| Archive contents (top-level) | `repos/`, `data/`, `forgejo-db.sql`, `app.ini` | + +Repos present in dump: `forgejo-actions-probe`, `glas-harness`, `key-cape` +(all under `repos/coulomb/`). + +## Restore target + +| Field | Value | +| --- | --- | +| Namespace | `forgejo-restore-drill` | +| Database | CNPG cluster `forgejo-db-restore` (isolated, 1 instance) | +| App data PVC | `forgejo-restore-data` (`local-path`, 10Gi) | +| Helm release | `forgejo-restore` (`gitea-charts/gitea` 12.5.0) | +| Orchestration | `tools/forgejo-restore-drill.sh` | + +Restore path (Forgejo 11.0.3 has no `forgejo restore` CLI): + +1. Unzip dump into import pod staging area. +2. Copy `repos/` → `/data/git/gitea-repositories/`. +3. Copy `data/` → `/data/` (packages, attachments, avatars). +4. Import `forgejo-db.sql` via `psql` into `forgejo-db-restore`. +5. Deploy isolated Helm release bound to restored PVC + restore DB host. + +## Post-restore checks (2026-07-04) + +Port-forward: `svc/forgejo-restore-gitea-http` → `127.0.0.1:13000` + +| Check | Result | +| --- | --- | +| `GET /` health | HTTP 200 | +| `GET /api/v1/repos/coulomb/glas-harness` | `full_name=coulomb/glas-harness`, `default_branch=main` | +| `GET /api/v1/repos/coulomb/key-cape` | `full_name=coulomb/key-cape`, `default_branch=main` | +| `GET /api/v1/orgs/coulomb/repos` | 3 repos: `forgejo-actions-probe`, `glas-harness`, `key-cape` | + +Script exit marker: `restore-drill-complete` + +## RPO / RTO (drill scope) + +| Metric | Observed / assumed | +| --- | --- | +| RPO (manual dump) | Point-in-time of `forgejo dump` execution; no scheduled backup yet | +| RTO (isolated restore) | ~3–5 minutes for CNPG ready + import + Helm deploy on railiance01 | +| Production impact | None — read-only dump from running pod; separate namespace | + +## Gaps (not closed by this drill) + +- **Scheduled backups:** CNPG `Backup` CRs and off-cluster target not configured + (`kubectl cnpg` plugin absent on workstation). +- **Encryption at rest:** dump stored locally on workstation for drill only; no + approved backup target wired. +- **Automation:** `forgejo dump` is manual; T04/T09 still need cron/operator + schedule and retention policy (T02 decision). +- **Re-run hygiene:** concurrent or repeat runs require `DRILL_CLEAN=1` to wipe + `forgejo-restore-drill` before import (SQL import is not idempotent). + +## Cleanup + +After evidence capture, delete the drill namespace: + +```bash +kubectl delete namespace forgejo-restore-drill --wait=true +``` + +Production Forgejo (`forgejo` namespace) and Gitea remain unchanged. + +## References + +- `infra/forgejo-restore-drill/forgejo-db-restore-cluster.yaml` +- `infra/forgejo-restore-drill/restore-job.yaml` +- `tools/forgejo-restore-drill.sh` +- `workplans/RAIL-HO-WP-0005-forgejo-production-migration.md` (T09) \ No newline at end of file diff --git a/infra/forgejo-restore-drill/forgejo-db-restore-cluster.yaml b/infra/forgejo-restore-drill/forgejo-db-restore-cluster.yaml new file mode 100644 index 0000000..302c396 --- /dev/null +++ b/infra/forgejo-restore-drill/forgejo-db-restore-cluster.yaml @@ -0,0 +1,21 @@ +--- +apiVersion: postgresql.cnpg.io/v1 +kind: Cluster +metadata: + name: forgejo-db-restore + namespace: forgejo-restore-drill + labels: + app.kubernetes.io/name: forgejo-db-restore + railiance.io/layer: s3-platform + railiance.io/consumer: forgejo-restore-drill +spec: + instances: 1 + imageName: ghcr.io/cloudnative-pg/postgresql:16 + storage: + size: 10Gi + bootstrap: + initdb: + database: forgejo + owner: forgejo + secret: + name: forgejo-db-credentials \ No newline at end of file diff --git a/infra/forgejo-restore-drill/restore-job.yaml b/infra/forgejo-restore-drill/restore-job.yaml new file mode 100644 index 0000000..0c02577 --- /dev/null +++ b/infra/forgejo-restore-drill/restore-job.yaml @@ -0,0 +1,12 @@ +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: forgejo-restore-data + namespace: forgejo-restore-drill +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 10Gi + storageClassName: local-path \ No newline at end of file diff --git a/tools/forgejo-restore-drill.sh b/tools/forgejo-restore-drill.sh new file mode 100755 index 0000000..10fb43e --- /dev/null +++ b/tools/forgejo-restore-drill.sh @@ -0,0 +1,115 @@ +#!/usr/bin/env bash +# Non-production Forgejo backup/restore drill (RAIL-HO-WP-0005-T09). +# Re-run: DRILL_CLEAN=1 ./tools/forgejo-restore-drill.sh (wipes namespace first) +set -euo pipefail + +KUBECONFIG="${KUBECONFIG:-$HOME/.kube/config-hosteurope}" +export KUBECONFIG +NS=forgejo-restore-drill +DRILL_CLEAN="${DRILL_CLEAN:-0}" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +ROOT_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)" +BACKUP_LOCAL="${BACKUP_LOCAL:-/tmp/forgejo-drill/forgejo-drill-backup.zip}" +PROD_POD="${PROD_POD:-$(kubectl get pods -n forgejo -l app.kubernetes.io/instance=forgejo -o jsonpath='{.items[0].metadata.name}')}" + +step() { echo "==> $*"; } + +if [[ "${DRILL_CLEAN}" == "1" ]]; then + step "Clean prior drill namespace ${NS}" + kubectl delete namespace "${NS}" --wait=true --timeout=5m || true +fi + +step "Create namespace ${NS}" +kubectl create namespace "${NS}" --dry-run=client -o yaml | kubectl apply -f - + +step "Copy forgejo-db-credentials into ${NS}" +kubectl get secret forgejo-db-credentials -n databases -o json \ + | python3 -c "import json,sys; s=json.load(sys.stdin); s['metadata']={k:v for k,v in s['metadata'].items() if k in ('name','labels','annotations')}; s['metadata']['namespace']='${NS}'; print(json.dumps(s))" \ + | kubectl apply -f - + +step "Deploy restore CNPG cluster" +kubectl apply -f "${ROOT_DIR}/infra/forgejo-restore-drill/forgejo-db-restore-cluster.yaml" +kubectl wait --for=condition=Ready cluster/forgejo-db-restore -n "${NS}" --timeout=10m + +step "Ensure local backup exists" +if [[ ! -f "${BACKUP_LOCAL}" ]]; then + kubectl exec -n forgejo "${PROD_POD}" -c gitea -- forgejo dump -f /tmp/forgejo-drill-backup.zip + mkdir -p "$(dirname "${BACKUP_LOCAL}")" + kubectl cp "forgejo/${PROD_POD}:/tmp/forgejo-drill-backup.zip" "${BACKUP_LOCAL}" -c gitea +fi +ls -lh "${BACKUP_LOCAL}" + +step "Apply restore PVC" +kubectl apply -f "${ROOT_DIR}/infra/forgejo-restore-drill/restore-job.yaml" + +step "Run restore pod (stage backup, import files + SQL)" +kubectl delete pod forgejo-restore-import -n "${NS}" --ignore-not-found --wait=true +cat </dev/null +rm -rf /data/* +mkdir -p /data/git/gitea-repositories +unzip -q /backup/forgejo-drill-backup.zip -d /tmp/dump +cp -a /tmp/dump/repos/. /data/git/gitea-repositories/ +cp -a /tmp/dump/data/. /data/ +chown -R git:git /data +PGPASSWORD="${POSTGRES_PASSWORD}" psql -h forgejo-db-restore-rw.forgejo-restore-drill.svc.cluster.local -U forgejo -d forgejo -v ON_ERROR_STOP=1 -f /tmp/dump/forgejo-db.sql +echo restore-import-ok +' +unset DB_PASS +kubectl delete pod forgejo-restore-import -n "${NS}" --wait=true + +step "Deploy isolated Forgejo release" +cd "${HOME}/railiance-apps" +DB_PASS="$(kubectl get secret forgejo-db-credentials -n "${NS}" -o jsonpath='{.data.password}' | base64 -d)" +helm upgrade --install forgejo-restore gitea-charts/gitea --version 12.5.0 \ + --namespace "${NS}" --create-namespace \ + -f helm/forgejo-values.yaml \ + -f helm/forgejo-registry-values.yaml \ + --set strategy.type=Recreate \ + --set persistence.existingClaim=forgejo-restore-data \ + --set gitea.config.database.HOST=forgejo-db-restore-rw.${NS}.svc.cluster.local:5432 \ + --set gitea.config.database.PASSWD="${DB_PASS}" \ + --set gitea.config.server.DOMAIN=forgejo-restore.local \ + --set gitea.config.server.ROOT_URL=http://forgejo-restore.local:3000/ \ + --set gitea.admin.password=restore-drill-local-only \ + --set ingress.enabled=false \ + --wait --timeout=10m +unset DB_PASS + +step "Post-restore checks via port-forward" +kubectl port-forward -n "${NS}" svc/forgejo-restore-gitea-http 13000:3000 >/tmp/forgejo-restore-pf.log 2>&1 & +PF_PID=$! +sleep 5 +curl -fsS -o /dev/null -w 'health:%{http_code}\n' http://127.0.0.1:13000/ +curl -fsS http://127.0.0.1:13000/api/v1/repos/coulomb/glas-harness | python3 -c "import json,sys; d=json.load(sys.stdin); print('repo', d.get('full_name'), d.get('default_branch'))" +curl -fsS http://127.0.0.1:13000/api/v1/repos/coulomb/key-cape | python3 -c "import json,sys; d=json.load(sys.stdin); print('repo', d.get('full_name'), d.get('default_branch'))" +kill "${PF_PID}" 2>/dev/null || true +echo "restore-drill-complete" \ No newline at end of file diff --git a/workplans/RAIL-HO-WP-0005-forgejo-production-migration.md b/workplans/RAIL-HO-WP-0005-forgejo-production-migration.md index f762088..4eaef57 100644 --- a/workplans/RAIL-HO-WP-0005-forgejo-production-migration.md +++ b/workplans/RAIL-HO-WP-0005-forgejo-production-migration.md @@ -109,7 +109,8 @@ acceptance criteria below are tracked across T05, T07, T08, and T10 instead. Still to prove before T11: - SMTP/password reset end-to-end (T06). -- Backup and restore in isolated namespace (T09). +- Backup and restore in isolated namespace (T09) — **drill passed 2026-07-04**; + scheduled automation pending. - Issues/releases/wiki/LFS per inventory classification (T10 matrix). - Operator SSH identity on Forgejo beyond interim `forgejo_admin` keys (T02/T10). @@ -377,7 +378,7 @@ a pullable image without privileged cluster-wide credentials. **Tier 2: done.** ```task id: RAIL-HO-WP-0005-T09 -status: todo +status: progress priority: high state_hub_task_id: "25892007-36ca-4bd9-8adf-84d505465d7d" ``` @@ -402,8 +403,17 @@ Acceptance: - Restore into an isolated namespace is drilled and documented. - RPO/RTO expectations are recorded. +**Partial (2026-07-04):** isolated restore drill **passed**. Production +`forgejo dump` (~11.7 MiB) restored into `forgejo-restore-drill` namespace; +post-restore API checks: health 200, `coulomb/glas-harness` and +`coulomb/key-cape` on `main`, 3 org repos visible. Evidence: +`docs/forgejo-restore-drill-evidence.md`. Assets: `infra/forgejo-restore-drill/`, +`tools/forgejo-restore-drill.sh`. Remaining: scheduled CNPG/off-cluster backups, +encryption/approved target (T02/T04), automated dump schedule. + **Done when:** a fresh backup restores to a working isolated Forgejo instance -with repository, package, and user recovery checks passing. +with repository, package, and user recovery checks passing **and** scheduled +backups run without manual intervention. --- @@ -520,8 +530,9 @@ T05+T08 ──► T10 migration ladder ──► T11 production cutover ── T03 isolated probe: CANCELLED (superseded by T05 + in-production pilots) ``` -**Current focus (2026-07-04):** T10 tiers 0–2 **complete**; T09 backup drill -and T02 open decisions (SMTP, backup target) before tier-3 production repos. +**Current focus (2026-07-04):** T10 tiers 0–2 **complete**; T09 restore drill +**passed** (scheduled backups + backup target still open); T02 decisions (SMTP, +backup target) before tier-3 production repos. Do not start T11 `state-hub` until T09 complete and `CUST-WP-0054` Wave-1 gates satisfied.