railiance-infra/workplans/RAIL-HO-WP-0005-forgejo-production-migration.md
tegwick 092315895f RAIL-HO-WP-0005-T09: Forgejo backup/restore drill assets and evidence
Add isolated-namespace restore drill (CNPG cluster, PVC, orchestration script)
and document successful 2026-07-04 run: production forgejo dump restored with
health 200 and pilot repos visible via API. Scheduled backups remain open.
2026-07-04 11:26:50 +02:00

570 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
id: RAIL-HO-WP-0005
type: workplan
title: "Forgejo Production Migration on railiance01"
domain: financials
repo: railiance-infra
status: active
owner: railiance
topic_slug: railiance
created: "2026-05-03"
updated: "2026-07-04"
state_hub_workstream_id: "84e17675-0d15-4268-a8bd-540124d37018"
---
# Forgejo Production Migration on railiance01
## Goal
Establish Forgejo as the production-grade source forge and package base for
Railiance, then migrate all repositories and workflows currently relying on
Gitea to the new Forgejo installation.
Forgejo will become the heart of Railiance infrastructure. The work must be
fully automated, backup-backed, recovery-drilled, and suitable for long-lived
operation on railiance01 before any production cutover happens.
**Sequencing update (2026-07-04):** Production Forgejo is live on railiance01
with Gitea still canonical per the safety contract. Repo cutover proceeds
**staged per-repo** using a migration ladder (disposable probes → non-production
pilots → image-capable pilots → production repos). `state-hub` is last. See
`CUST-WP-0054-T04` and
`the-custodian/docs/forgejo-repo-migration-pilot-glas-harness.md`.
## Placement in the Railiance Tooling Set
This workplan lives in `railiance-infra` because it is the cross-layer
production infrastructure coordination plan and belongs next to
`RAIL-HO-WP-0004-production-readiness.md`.
Implementation must respect the OAS repo boundaries:
| Concern | Repo | Layer |
|---------|------|-------|
| Server prerequisites, inventory, OS packages, SSH/system users | `railiance-infra` | S1 |
| k3s runtime prerequisites, namespaces, ingress class, cluster backup hooks | `railiance-cluster` | S2 |
| PostgreSQL, object storage, backup targets, registry storage dependencies | `railiance-platform` | S3 |
| Forgejo Actions runner templates, CI conventions, migration automation | `railiance-enablement` | S4 |
| Forgejo Helm release, app config, mail config, package registry, app backups | `railiance-apps` | S5 |
This file is the umbrella plan. If an implementation step requires files in a
different repo, that repo should receive its own workplan or task before the
change is made there.
## Key Decisions to Confirm
1. ~~Public/private hostname for Forgejo~~ **DECIDED 2026-07-03:**
`forgejo.coulomb.social` → railiance01 (`92.205.62.239`). DNS active;
Traefik edge live; Forgejo workload deployed and serving HTTPS. Gitea remains
canonical until migration drills pass. Record:
`the-custodian/docs/forgejo-production-decisions.md`.
2. Mail delivery path for password reset and account recovery
(SMTP relay, sender domain, SPF/DKIM/DMARC expectations).
3. Package registry scope: container images only at first, or also generic,
npm, PyPI, Go, Maven, and Helm packages.
4. ~~Actions runner model~~ **DECIDED 2026-07-03:** in-cluster long-lived runner
Deployment with DinD sidecar on railiance01 (`ADR-004`). Interim coulombcore
host runner retired after cutover.
5. Backup destination and retention target for database, repositories,
attachments, LFS, Actions artifacts/logs, and package data.
6. Cutover mode: ~~freeze-all vs staged~~ **LEANING staged per-repo (2026-07-04)**
based on `glas-harness` pilot; operator confirmation still needed. Freeze-all
remains fallback for final production wave if drift risk is unacceptable.
## Safety Contract
- Gitea remains the production source of truth until Forgejo restore and
migration drills pass.
- No repository is deleted from Gitea during this workplan.
- A fresh Gitea backup must be taken before every migration drill and before
final cutover.
- Forgejo backups must be restored into an isolated namespace before accepting
production use.
- Password reset and email recovery must be verified with a real controlled
account before onboarding users.
- Forgejo Actions may not receive broad cluster credentials by default; runner
permissions must be least-privilege and repo-scoped where practical.
- Secrets stay in SOPS/age or Kubernetes Secrets managed by the appropriate
repo. No plaintext SMTP passwords, admin tokens, runner tokens, or registry
credentials in Git.
## Probe and pilot strategy (revised 2026-07-04)
Original T03 planned a **disposable isolated-namespace probe** before any
production install. That path was **superseded**: production Forgejo deployed on
railiance01 under the safety contract (Gitea remains canonical; no Gitea deletes).
Integration evidence now comes from **in-production probes and repo pilots**:
| Tier | Repo | Purpose | Status |
| --- | --- | --- | --- |
| 0 | `coulomb/forgejo-actions-probe` | Runner scheduling, DinD, OCI image-build | **done** |
| 1 | `coulomb/glas-harness` | Non-production git+SSH+CI routing drill | **done** |
| 2 | `coulomb/key-cape` | Image-build workflow + registry pull on railiance01 | **done** |
| 3 | Production set (`state-hub`, `issue-core`, …) | Canonical remotes, sweep paths, deploy loops | **gated** |
Each tier must pass before the next. T03 (isolated probe namespace) is cancelled;
acceptance criteria below are tracked across T05, T07, T08, and T10 instead.
Still to prove before T11:
- SMTP/password reset end-to-end (T06).
- Backup and restore in isolated namespace (T09) — **drill passed 2026-07-04**;
scheduled automation pending.
- Issues/releases/wiki/LFS per inventory classification (T10 matrix).
- Operator SSH identity on Forgejo beyond interim `forgejo_admin` keys (T02/T10).
## Target Architecture
```
operator / agents / developers
-> https://forgejo.coulomb.social (railiance01 Traefik ingress)
-> forgejo Service in forgejo namespace
-> Forgejo Deployment/StatefulSet
-> forgejo-db CloudNative PG Cluster in databases namespace
-> Valkey/cache if required
-> persistent storage for repositories, attachments, LFS, packages
-> Actions runner(s) with restricted execution scope
-> backup jobs to the approved backup target
```
## Tasks
### T01 — Inventory current Gitea functionality and migration requirements
```task
id: RAIL-HO-WP-0005-T01
status: progress
priority: high
state_hub_task_id: "cf59d171-5629-45c9-9d44-8d6499827ffc"
```
Create a source-of-truth inventory of current Gitea usage.
First-pass inventory artifact: `docs/forgejo-migration-inventory.md`.
Minimum inventory:
- All repositories in the `coulomb` organization.
- Registered vs unregistered State Hub repos.
- Users, organizations, teams, deploy keys, SSH keys, access tokens.
- Issues, labels, milestones, releases, wiki, packages, LFS, attachments.
- Existing webhook usage and automation assumptions.
- Current Gitea package registry status and the missing `[packages]` config
that is blocking container image publication.
**Done when:** the inventory identifies every feature that must work in
Forgejo before cutover and classifies each migration item as automatic,
manual, unsupported, or explicitly out of scope.
**Gap (2026-07-04):** first-pass inventory predates repos created after
2026-06-04 (e.g. `glas-harness`, `forgejo-actions-probe`). Refresh org repo
list and add a **migration tier** column (03) per repo before T11.
---
### T02 — Resolve Forgejo production design decisions
```task
id: RAIL-HO-WP-0005-T02
status: progress
priority: high
needs_human: true
state_hub_task_id: "f88115bf-4f99-49ef-a415-0b23750141b3"
```
Decide the production choices listed in "Key Decisions to Confirm".
**Partial (2026-07-04):** hostname, exposure, deployment pattern, live deploy,
and in-cluster runner model decided (`ADR-004`). Cutover mode **leaning** staged
per-repo (glas-harness pilot). Remaining operator decisions: SMTP, package scope
beyond OCI, backup target, final cutover confirmation. See
`the-custodian/docs/forgejo-production-decisions.md`.
Expected output:
- A short decision record in this workplan or a dedicated ADR.
- Hostname and exposure model. ✓ hostname; exposure follows railiance01 Traefik
- SMTP provider and sender identity.
- Package registry scope.
- Actions runner isolation model.
- Backup target, retention, encryption, and restore cadence.
- Cutover strategy and rollback window.
**Done when:** implementation tasks are no longer blocked by open production
choices.
---
### T03 — Build forgejo-railiance-probe (isolated namespace)
```task
id: RAIL-HO-WP-0005-T03
status: cancel
priority: high
state_hub_task_id: "b516018a-415e-4a58-8c62-07c14ece9353"
```
**Cancelled 2026-07-04:** superseded by production Forgejo on railiance01 (T05)
plus in-production integration probes (`forgejo-actions-probe`, `glas-harness`).
Isolated-namespace probe added latency without reducing risk given the safety
contract (Gitea canonical, no deletes). Remaining T03 acceptance items map to:
T05 (deploy), T06 (mail), T07 (packages), T08 (Actions), T09 (backup restore),
T10 (repo migration drill).
---
### T04 — Define Forgejo platform services
```task
id: RAIL-HO-WP-0005-T04
status: todo
priority: high
state_hub_task_id: "28b351fe-bfbe-4a8b-bbfa-1b148e69f8e0"
```
In `railiance-platform`, define production platform services for Forgejo.
Minimum scope:
- `forgejo-db` CloudNative PG cluster.
- Database credentials via SOPS-managed Secret or approved secret flow.
- Backup configuration for database base backups and WAL archiving.
- Object storage or persistent volume plan for repositories, attachments, LFS,
packages, Actions artifacts, and logs.
- Restore runbook for database and blob/package data.
**Partial (2026-07-04):** `forgejo-db` CNPG cluster healthy on railiance01
(`make forgejo-db-status` → Cluster in healthy state). SOPS secret path and
network policies in `railiance-platform`. Remaining: backup/WAL archiving to
approved target, blob/package storage restore drill (feeds T09).
**Done when:** platform dependencies can be deployed and restored without the
Forgejo app running.
---
### T05 — Define production Forgejo application deployment
```task
id: RAIL-HO-WP-0005-T05
status: progress
priority: high
state_hub_task_id: "11540ba4-d31c-4f64-836b-c6de69107aa4"
```
In `railiance-apps`, create the production Forgejo deployment.
Minimum scope:
- Forgejo Helm release or manifests in the S5 boundary.
- App configuration for database, SSH, HTTPS, mailer, packages, LFS, and
security settings.
- Initial admin/user bootstrap that is automated but does not commit secrets.
- Health/status targets in the Makefile.
- Migration-safe configuration for coexistence with Gitea during the cutover.
**Partial (2026-07-04):** `railiance-apps` deploy live — HTTPS smoke pass,
ingress + TLS, SSH NodePort `30022`, Actions enabled, `coulomb` org,
`railiance01-build-01` runner (ADR-004). Git push/pull via HTTPS and
`forgejo-remote` SSH proven. Remaining: SOPS hardening for all secrets,
SMTP (T06), operator user accounts beyond `forgejo_admin`.
**Done when:** Forgejo runs on railiance01 against production platform
services and can serve login, git clone/push, package registry, and admin
operations.
---
### T06 — Implement usable email recovery cycle
```task
id: RAIL-HO-WP-0005-T06
status: todo
priority: high
needs_human: true
state_hub_task_id: "417faa4d-eab8-4247-9485-4f80e5d5b7ff"
```
Configure and test mail delivery for account recovery.
Minimum scope:
- SMTP credentials stored through the approved secret path.
- Sender address and domain alignment documented.
- Password reset email works for a controlled non-admin account.
- Account recovery runbook covers lost password, lost MFA, disabled account,
and emergency admin access.
- Mail failure is observable through logs or a health check.
**Done when:** a user can complete password recovery without operator database
edits, and the operator has a documented emergency path.
---
### T07 — Enable and harden package registry base
```task
id: RAIL-HO-WP-0005-T07
status: todo
priority: high
state_hub_task_id: "9578f672-e2b8-43a3-8419-5f86f8871326"
```
Enable Forgejo packages for Railiance's near-term build and deployment needs.
Initial package types:
- Container registry for State Hub and future app images.
- Generic packages for release artifacts.
- Additional package types only after the inventory proves they are needed.
Acceptance:
- Authenticated push and pull works from operator workstation and railiance01.
- Container image pull works from k3s deployments.
- Retention and cleanup expectations are documented.
- Package data is included in backup and restore drills.
**Partial (2026-07-04):** OCI registry live (`/v2/` auth challenge). Tier-0/2
images built and pulled on railiance01: `forgejo-actions-probe`, `key-cape`
(`crictl pull forgejo.coulomb.social/coulomb/key-cape:latest` succeeded).
Remaining: `state-hub` image after tier-3 approval; document retention; include
packages in backup drill (T09).
**Done when:** tier-2 gate is fully satisfied (✓) and tier-3 production images
follow the same pattern after explicit approval.
---
### T08 — Enable Forgejo Actions
```task
id: RAIL-HO-WP-0005-T08
status: progress
priority: high
state_hub_task_id: "f45f98c9-2f02-4224-bbfd-c2e1ec38581e"
```
Enable Forgejo Actions with a least-privilege runner model.
Minimum scope:
- Runner registration automated without committing runner tokens.
- Runner isolation model documented.
- Minimal workflows for lint/test/build on representative repositories.
- Workflow to build and publish a probe container image to Forgejo packages.
- Secret handling policy for Actions.
- Resource limits to avoid repeating previous single-node overload patterns.
**Partial (2026-07-04):** in-cluster runner live (`railiance-apps/manifests/
forgejo-runner.yaml`, ADR-004). Proven workflows: `forgejo-actions-probe`
(image-build), `glas-harness` (host+container CI smoke). Org secrets
`REGISTRY_USER`/`REGISTRY_TOKEN` set. Documented constraints: host runner is
non-root (static docker-cli, no `apk add`); `actions/checkout@v4` fails — use
`git clone` in job. Remaining: reusable workflow templates in
`railiance-enablement` (S4); resource limits review; no cluster-admin on runner.
**Partial (2026-07-04):** tier-2 satisfied by `key-cape` (`container-build`,
archive checkout, static docker-cli). Remaining: publish reusable workflow
template in `railiance-enablement` (S4).
**Done when:** tier-2 pilot repo runs Forgejo Actions end-to-end and publishes
a pullable image without privileged cluster-wide credentials. **Tier 2: done.**
---
### T09 — Implement Forgejo backup and restore automation
```task
id: RAIL-HO-WP-0005-T09
status: progress
priority: high
state_hub_task_id: "25892007-36ca-4bd9-8adf-84d505465d7d"
```
Create backup automation for all Forgejo state.
Must cover:
- PostgreSQL database.
- Git repositories.
- Attachments.
- LFS.
- Packages.
- Avatars and app data.
- Actions logs/artifacts if retained.
- App configuration required for restore.
Acceptance:
- Scheduled backups run without manual intervention.
- Backups are encrypted or stored in an approved protected target.
- Restore into an isolated namespace is drilled and documented.
- RPO/RTO expectations are recorded.
**Partial (2026-07-04):** isolated restore drill **passed**. Production
`forgejo dump` (~11.7 MiB) restored into `forgejo-restore-drill` namespace;
post-restore API checks: health 200, `coulomb/glas-harness` and
`coulomb/key-cape` on `main`, 3 org repos visible. Evidence:
`docs/forgejo-restore-drill-evidence.md`. Assets: `infra/forgejo-restore-drill/`,
`tools/forgejo-restore-drill.sh`. Remaining: scheduled CNPG/off-cluster backups,
encryption/approved target (T02/T04), automated dump schedule.
**Done when:** a fresh backup restores to a working isolated Forgejo instance
with repository, package, and user recovery checks passing **and** scheduled
backups run without manual intervention.
---
### T10 — Drill Gitea to Forgejo migration (staged ladder)
```task
id: RAIL-HO-WP-0005-T10
status: progress
priority: high
state_hub_task_id: "6befde73-00bc-4643-be0b-a7ce7944e75f"
```
Run staged migration drills from Gitea to Forgejo before production repos move.
**Tier 1 complete (2026-07-04):** `glas-harness` — git history preserved,
`origin` on Forgejo, `gitea` legacy remote retained, SSH+HTTPS push, CI smoke
green. Result matrix:
`the-custodian/docs/forgejo-repo-migration-pilot-glas-harness.md`.
Minimum checks (per tier):
- Git history and default branches preserved.
- Issues, labels, milestones, releases, wiki, and attachments handled per
inventory classification (N/A for tier-1 git-only repos).
- SSH/HTTPS clone and push paths work (`forgejo-remote` in `~/.ssh/config`).
- Existing local remotes can be transformed predictably (`origin`/`gitea` split).
- State Hub registered repo remotes can be updated safely (deferred for tier-1).
- Rollback plan is rehearsed (Gitea copy unchanged).
**Tier 2 complete (2026-07-04):** `key-cape` — multi-stage Dockerfile built and
pushed via archive-checkout workflow; `crictl pull` on railiance01 succeeded.
Evidence in `the-custodian/docs/forgejo-repo-migration-pilot-glas-harness.md`
(tier 2 section).
**Not ready:** `state-hub` (tier 3) until hub-core build context template and
sweep `remote_url` playbook exist.
**Done when:** tiers 02 pass with written result matrices and no unknown
critical migration gaps remain for production repos.
---
### T11 — Production cutover from Gitea to Forgejo
```task
id: RAIL-HO-WP-0005-T11
status: todo
priority: high
needs_human: true
state_hub_task_id: "b1b66687-ca33-4971-b312-743c8e059c5e"
```
Execute production migration only after T06, T07, T08, T09, and T10 tier 02
gates pass. `state-hub` and other Wave-1 production repos require explicit
operator approval per `CUST-WP-0054` drain sequence.
**Preferred cutover (staged per-repo):**
1. Per repo: Gitea backup snapshot (or org-wide before each wave).
2. Mirror git to Forgejo; switch workstation `origin` to `forgejo-remote`.
3. Port/verify Actions workflows on Forgejo runner.
4. Update State Hub `remote_url` and railiance01 sweep checkouts when promoted.
5. Mark Gitea repo read-only (org policy); do not delete.
6. Repeat until production set complete.
**Freeze-all fallback:** single window if staged drift is unacceptable — same
steps but all repos in one maintenance period.
**Done when:** all Railiance/Custodian repos use Forgejo as primary, Gitea is
read-only fallback, and rollback instructions are documented.
---
### T12 — Retire or archive legacy Gitea
```task
id: RAIL-HO-WP-0005-T12
status: todo
priority: medium
needs_human: true
state_hub_task_id: "a63147b0-31d5-4705-89ea-40c10faf779f"
```
Retire legacy Gitea only after a stabilization period and explicit approval.
Minimum scope:
- Confirm no active remotes, webhooks, packages, or dashboards depend on Gitea.
- Preserve final Gitea backup.
- Update runbooks and dashboards from Gitea to Forgejo.
- Remove or archive Gitea Helm release according to the rollback decision.
- Close stale State Hub references to `railiance-bootstrap` if confirmed as
an alias rather than a real repo.
**Done when:** Forgejo is the only active source forge and package base, with
legacy Gitea either archived or intentionally retained as documented fallback.
## Phasing and Dependencies
```
T01 inventory ──► T02 decisions ──┬──► T04 platform (forgejo-db ✓ partial)
├──► T05 app (live ✓ partial)
├──► T06 mail recovery
├──► T07 packages (OCI probe ✓ partial)
├──► T08 actions (runner ✓ partial)
└──► T09 backups
T05+T08 ──► T10 migration ladder ──► T11 production cutover ──► T12 Gitea retire
tier0 probe ✓
tier1 glas-harness ✓
tier2 key-cape ✓
tier3 production (gated)
T03 isolated probe: CANCELLED (superseded by T05 + in-production pilots)
```
**Current focus (2026-07-04):** T10 tiers 02 **complete**; T09 restore drill
**passed** (scheduled backups + backup target still open); T02 decisions (SMTP,
backup target) before tier-3 production repos.
Do not start T11 `state-hub` until T09 complete and `CUST-WP-0054` Wave-1
gates satisfied.
**Absorbed by `CUST-WP-0054-T04`:** forge + CI on railiance01; workstation
build retirement; staged repo promotion before State Hub primary move (T05).
## railiance-bootstrap Note
State Hub currently registers both `railiance-bootstrap` and
`railiance-cluster`, but they point to the same local path
(`/home/worsch/railiance-cluster`) and the same git fingerprint. The
`railiance-bootstrap` entry has no remote URL. The earlier restructure workplan
(`RAIL-HO-WP-0003-T03`) says `railiance-bootstrap` was renamed to
`railiance-cluster`.
Working assumption: `railiance-bootstrap` is a stale logical alias or leftover
repo goal, not a separate Gitea repository. This workplan should not create a
new Forgejo repository named `railiance-bootstrap` unless a concrete remaining
purpose is identified.
## References
- `RAIL-HO-WP-0004-production-readiness.md`
- `RAIL-HO-WP-0003-5repo-stack-restructure.md`
- `CUST-WP-0054-workstation-independence-and-fleet-realignment.md` (T04 forge+CI)
- `CUST-WP-0014-repo-sync-automation.md`
- `CUST-WP-0021-multi-host-repo-paths.md`
- `docs/adr/ADR-004-forgejo-in-cluster-actions-runner.md`
- `docs/forgejo-migration-inventory.md`
- `the-custodian/docs/forgejo-production-decisions.md`
- `the-custodian/docs/forgejo-repo-migration-pilot-glas-harness.md`
- `railiance-apps/docs/forgejo-on-railiance01.md`
- `railiance-forge/docs/forgejo-actions-runner-substrate.md`
- `ops/incidents/2026-03-25-gitea-pgpool-crashloop.md`
- `ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md`