RAIL-HO-WP-0005-T09: Forgejo backup/restore drill assets and evidence

Add isolated-namespace restore drill (CNPG cluster, PVC, orchestration script)
and document successful 2026-07-04 run: production forgejo dump restored with
health 200 and pilot repos visible via API. Scheduled backups remain open.
This commit is contained in:
tegwick 2026-07-04 11:26:50 +02:00
parent 2d62317ada
commit 092315895f
5 changed files with 257 additions and 5 deletions

View file

@ -0,0 +1,93 @@
# Forgejo Backup/Restore Drill Evidence
Date: 2026-07-04
Workplan: RAIL-HO-WP-0005
Task: RAIL-HO-WP-0005-T09
`no_secret_material_recorded: true`
## Purpose
Prove that a production `forgejo dump` can be restored into an isolated
namespace and serve repository metadata without touching production Forgejo or
Gitea.
## Backup source
| Field | Value |
| --- | --- |
| Method | `forgejo dump` from production pod |
| Production pod | `forgejo-gitea-64c5b57684-ph9vt` (namespace `forgejo`) |
| Archive path (workstation) | `/tmp/forgejo-drill/forgejo-drill-backup.zip` |
| Archive size | 12,284,847 bytes (~11.7 MiB) |
| Archive timestamp | 2026-07-04 11:20 +0200 |
| Archive contents (top-level) | `repos/`, `data/`, `forgejo-db.sql`, `app.ini` |
Repos present in dump: `forgejo-actions-probe`, `glas-harness`, `key-cape`
(all under `repos/coulomb/`).
## Restore target
| Field | Value |
| --- | --- |
| Namespace | `forgejo-restore-drill` |
| Database | CNPG cluster `forgejo-db-restore` (isolated, 1 instance) |
| App data PVC | `forgejo-restore-data` (`local-path`, 10Gi) |
| Helm release | `forgejo-restore` (`gitea-charts/gitea` 12.5.0) |
| Orchestration | `tools/forgejo-restore-drill.sh` |
Restore path (Forgejo 11.0.3 has no `forgejo restore` CLI):
1. Unzip dump into import pod staging area.
2. Copy `repos/``/data/git/gitea-repositories/`.
3. Copy `data/``/data/` (packages, attachments, avatars).
4. Import `forgejo-db.sql` via `psql` into `forgejo-db-restore`.
5. Deploy isolated Helm release bound to restored PVC + restore DB host.
## Post-restore checks (2026-07-04)
Port-forward: `svc/forgejo-restore-gitea-http``127.0.0.1:13000`
| Check | Result |
| --- | --- |
| `GET /` health | HTTP 200 |
| `GET /api/v1/repos/coulomb/glas-harness` | `full_name=coulomb/glas-harness`, `default_branch=main` |
| `GET /api/v1/repos/coulomb/key-cape` | `full_name=coulomb/key-cape`, `default_branch=main` |
| `GET /api/v1/orgs/coulomb/repos` | 3 repos: `forgejo-actions-probe`, `glas-harness`, `key-cape` |
Script exit marker: `restore-drill-complete`
## RPO / RTO (drill scope)
| Metric | Observed / assumed |
| --- | --- |
| RPO (manual dump) | Point-in-time of `forgejo dump` execution; no scheduled backup yet |
| RTO (isolated restore) | ~35 minutes for CNPG ready + import + Helm deploy on railiance01 |
| Production impact | None — read-only dump from running pod; separate namespace |
## Gaps (not closed by this drill)
- **Scheduled backups:** CNPG `Backup` CRs and off-cluster target not configured
(`kubectl cnpg` plugin absent on workstation).
- **Encryption at rest:** dump stored locally on workstation for drill only; no
approved backup target wired.
- **Automation:** `forgejo dump` is manual; T04/T09 still need cron/operator
schedule and retention policy (T02 decision).
- **Re-run hygiene:** concurrent or repeat runs require `DRILL_CLEAN=1` to wipe
`forgejo-restore-drill` before import (SQL import is not idempotent).
## Cleanup
After evidence capture, delete the drill namespace:
```bash
kubectl delete namespace forgejo-restore-drill --wait=true
```
Production Forgejo (`forgejo` namespace) and Gitea remain unchanged.
## References
- `infra/forgejo-restore-drill/forgejo-db-restore-cluster.yaml`
- `infra/forgejo-restore-drill/restore-job.yaml`
- `tools/forgejo-restore-drill.sh`
- `workplans/RAIL-HO-WP-0005-forgejo-production-migration.md` (T09)

View file

@ -0,0 +1,21 @@
---
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: forgejo-db-restore
namespace: forgejo-restore-drill
labels:
app.kubernetes.io/name: forgejo-db-restore
railiance.io/layer: s3-platform
railiance.io/consumer: forgejo-restore-drill
spec:
instances: 1
imageName: ghcr.io/cloudnative-pg/postgresql:16
storage:
size: 10Gi
bootstrap:
initdb:
database: forgejo
owner: forgejo
secret:
name: forgejo-db-credentials

View file

@ -0,0 +1,12 @@
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: forgejo-restore-data
namespace: forgejo-restore-drill
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: local-path

115
tools/forgejo-restore-drill.sh Executable file
View file

@ -0,0 +1,115 @@
#!/usr/bin/env bash
# Non-production Forgejo backup/restore drill (RAIL-HO-WP-0005-T09).
# Re-run: DRILL_CLEAN=1 ./tools/forgejo-restore-drill.sh (wipes namespace first)
set -euo pipefail
KUBECONFIG="${KUBECONFIG:-$HOME/.kube/config-hosteurope}"
export KUBECONFIG
NS=forgejo-restore-drill
DRILL_CLEAN="${DRILL_CLEAN:-0}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ROOT_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
BACKUP_LOCAL="${BACKUP_LOCAL:-/tmp/forgejo-drill/forgejo-drill-backup.zip}"
PROD_POD="${PROD_POD:-$(kubectl get pods -n forgejo -l app.kubernetes.io/instance=forgejo -o jsonpath='{.items[0].metadata.name}')}"
step() { echo "==> $*"; }
if [[ "${DRILL_CLEAN}" == "1" ]]; then
step "Clean prior drill namespace ${NS}"
kubectl delete namespace "${NS}" --wait=true --timeout=5m || true
fi
step "Create namespace ${NS}"
kubectl create namespace "${NS}" --dry-run=client -o yaml | kubectl apply -f -
step "Copy forgejo-db-credentials into ${NS}"
kubectl get secret forgejo-db-credentials -n databases -o json \
| python3 -c "import json,sys; s=json.load(sys.stdin); s['metadata']={k:v for k,v in s['metadata'].items() if k in ('name','labels','annotations')}; s['metadata']['namespace']='${NS}'; print(json.dumps(s))" \
| kubectl apply -f -
step "Deploy restore CNPG cluster"
kubectl apply -f "${ROOT_DIR}/infra/forgejo-restore-drill/forgejo-db-restore-cluster.yaml"
kubectl wait --for=condition=Ready cluster/forgejo-db-restore -n "${NS}" --timeout=10m
step "Ensure local backup exists"
if [[ ! -f "${BACKUP_LOCAL}" ]]; then
kubectl exec -n forgejo "${PROD_POD}" -c gitea -- forgejo dump -f /tmp/forgejo-drill-backup.zip
mkdir -p "$(dirname "${BACKUP_LOCAL}")"
kubectl cp "forgejo/${PROD_POD}:/tmp/forgejo-drill-backup.zip" "${BACKUP_LOCAL}" -c gitea
fi
ls -lh "${BACKUP_LOCAL}"
step "Apply restore PVC"
kubectl apply -f "${ROOT_DIR}/infra/forgejo-restore-drill/restore-job.yaml"
step "Run restore pod (stage backup, import files + SQL)"
kubectl delete pod forgejo-restore-import -n "${NS}" --ignore-not-found --wait=true
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: forgejo-restore-import
namespace: ${NS}
spec:
restartPolicy: Never
containers:
- name: restore
image: code.forgejo.org/forgejo/forgejo:11.0.3
command: ["sleep", "3600"]
volumeMounts:
- name: data
mountPath: /data
- name: backup
mountPath: /backup
volumes:
- name: data
persistentVolumeClaim:
claimName: forgejo-restore-data
- name: backup
emptyDir: {}
EOF
kubectl wait --for=condition=Ready pod/forgejo-restore-import -n "${NS}" --timeout=3m
kubectl cp "${BACKUP_LOCAL}" "${NS}/forgejo-restore-import:/backup/forgejo-drill-backup.zip" -c restore
DB_PASS="$(kubectl get secret forgejo-db-credentials -n "${NS}" -o jsonpath='{.data.password}' | base64 -d)"
kubectl exec -n "${NS}" forgejo-restore-import -c restore -- env POSTGRES_PASSWORD="${DB_PASS}" sh -c '
set -eu
apk add --no-cache unzip postgresql-client >/dev/null
rm -rf /data/*
mkdir -p /data/git/gitea-repositories
unzip -q /backup/forgejo-drill-backup.zip -d /tmp/dump
cp -a /tmp/dump/repos/. /data/git/gitea-repositories/
cp -a /tmp/dump/data/. /data/
chown -R git:git /data
PGPASSWORD="${POSTGRES_PASSWORD}" psql -h forgejo-db-restore-rw.forgejo-restore-drill.svc.cluster.local -U forgejo -d forgejo -v ON_ERROR_STOP=1 -f /tmp/dump/forgejo-db.sql
echo restore-import-ok
'
unset DB_PASS
kubectl delete pod forgejo-restore-import -n "${NS}" --wait=true
step "Deploy isolated Forgejo release"
cd "${HOME}/railiance-apps"
DB_PASS="$(kubectl get secret forgejo-db-credentials -n "${NS}" -o jsonpath='{.data.password}' | base64 -d)"
helm upgrade --install forgejo-restore gitea-charts/gitea --version 12.5.0 \
--namespace "${NS}" --create-namespace \
-f helm/forgejo-values.yaml \
-f helm/forgejo-registry-values.yaml \
--set strategy.type=Recreate \
--set persistence.existingClaim=forgejo-restore-data \
--set gitea.config.database.HOST=forgejo-db-restore-rw.${NS}.svc.cluster.local:5432 \
--set gitea.config.database.PASSWD="${DB_PASS}" \
--set gitea.config.server.DOMAIN=forgejo-restore.local \
--set gitea.config.server.ROOT_URL=http://forgejo-restore.local:3000/ \
--set gitea.admin.password=restore-drill-local-only \
--set ingress.enabled=false \
--wait --timeout=10m
unset DB_PASS
step "Post-restore checks via port-forward"
kubectl port-forward -n "${NS}" svc/forgejo-restore-gitea-http 13000:3000 >/tmp/forgejo-restore-pf.log 2>&1 &
PF_PID=$!
sleep 5
curl -fsS -o /dev/null -w 'health:%{http_code}\n' http://127.0.0.1:13000/
curl -fsS http://127.0.0.1:13000/api/v1/repos/coulomb/glas-harness | python3 -c "import json,sys; d=json.load(sys.stdin); print('repo', d.get('full_name'), d.get('default_branch'))"
curl -fsS http://127.0.0.1:13000/api/v1/repos/coulomb/key-cape | python3 -c "import json,sys; d=json.load(sys.stdin); print('repo', d.get('full_name'), d.get('default_branch'))"
kill "${PF_PID}" 2>/dev/null || true
echo "restore-drill-complete"

View file

@ -109,7 +109,8 @@ acceptance criteria below are tracked across T05, T07, T08, and T10 instead.
Still to prove before T11:
- SMTP/password reset end-to-end (T06).
- Backup and restore in isolated namespace (T09).
- Backup and restore in isolated namespace (T09) — **drill passed 2026-07-04**;
scheduled automation pending.
- Issues/releases/wiki/LFS per inventory classification (T10 matrix).
- Operator SSH identity on Forgejo beyond interim `forgejo_admin` keys (T02/T10).
@ -377,7 +378,7 @@ a pullable image without privileged cluster-wide credentials. **Tier 2: done.**
```task
id: RAIL-HO-WP-0005-T09
status: todo
status: progress
priority: high
state_hub_task_id: "25892007-36ca-4bd9-8adf-84d505465d7d"
```
@ -402,8 +403,17 @@ Acceptance:
- Restore into an isolated namespace is drilled and documented.
- RPO/RTO expectations are recorded.
**Partial (2026-07-04):** isolated restore drill **passed**. Production
`forgejo dump` (~11.7 MiB) restored into `forgejo-restore-drill` namespace;
post-restore API checks: health 200, `coulomb/glas-harness` and
`coulomb/key-cape` on `main`, 3 org repos visible. Evidence:
`docs/forgejo-restore-drill-evidence.md`. Assets: `infra/forgejo-restore-drill/`,
`tools/forgejo-restore-drill.sh`. Remaining: scheduled CNPG/off-cluster backups,
encryption/approved target (T02/T04), automated dump schedule.
**Done when:** a fresh backup restores to a working isolated Forgejo instance
with repository, package, and user recovery checks passing.
with repository, package, and user recovery checks passing **and** scheduled
backups run without manual intervention.
---
@ -520,8 +530,9 @@ T05+T08 ──► T10 migration ladder ──► T11 production cutover ──
T03 isolated probe: CANCELLED (superseded by T05 + in-production pilots)
```
**Current focus (2026-07-04):** T10 tiers 02 **complete**; T09 backup drill
and T02 open decisions (SMTP, backup target) before tier-3 production repos.
**Current focus (2026-07-04):** T10 tiers 02 **complete**; T09 restore drill
**passed** (scheduled backups + backup target still open); T02 decisions (SMTP,
backup target) before tier-3 production repos.
Do not start T11 `state-hub` until T09 complete and `CUST-WP-0054` Wave-1
gates satisfied.