Add isolated-namespace restore drill (CNPG cluster, PVC, orchestration script) and document successful 2026-07-04 run: production forgejo dump restored with health 200 and pilot repos visible via API. Scheduled backups remain open.
21 KiB
| id | type | title | domain | repo | status | owner | topic_slug | created | updated | state_hub_workstream_id |
|---|---|---|---|---|---|---|---|---|---|---|
| RAIL-HO-WP-0005 | workplan | Forgejo Production Migration on railiance01 | financials | railiance-infra | active | railiance | railiance | 2026-05-03 | 2026-07-04 | 84e17675-0d15-4268-a8bd-540124d37018 |
Forgejo Production Migration on railiance01
Goal
Establish Forgejo as the production-grade source forge and package base for Railiance, then migrate all repositories and workflows currently relying on Gitea to the new Forgejo installation.
Forgejo will become the heart of Railiance infrastructure. The work must be fully automated, backup-backed, recovery-drilled, and suitable for long-lived operation on railiance01 before any production cutover happens.
Sequencing update (2026-07-04): Production Forgejo is live on railiance01
with Gitea still canonical per the safety contract. Repo cutover proceeds
staged per-repo using a migration ladder (disposable probes → non-production
pilots → image-capable pilots → production repos). state-hub is last. See
CUST-WP-0054-T04 and
the-custodian/docs/forgejo-repo-migration-pilot-glas-harness.md.
Placement in the Railiance Tooling Set
This workplan lives in railiance-infra because it is the cross-layer
production infrastructure coordination plan and belongs next to
RAIL-HO-WP-0004-production-readiness.md.
Implementation must respect the OAS repo boundaries:
| Concern | Repo | Layer |
|---|---|---|
| Server prerequisites, inventory, OS packages, SSH/system users | railiance-infra |
S1 |
| k3s runtime prerequisites, namespaces, ingress class, cluster backup hooks | railiance-cluster |
S2 |
| PostgreSQL, object storage, backup targets, registry storage dependencies | railiance-platform |
S3 |
| Forgejo Actions runner templates, CI conventions, migration automation | railiance-enablement |
S4 |
| Forgejo Helm release, app config, mail config, package registry, app backups | railiance-apps |
S5 |
This file is the umbrella plan. If an implementation step requires files in a different repo, that repo should receive its own workplan or task before the change is made there.
Key Decisions to Confirm
Public/private hostname for ForgejoDECIDED 2026-07-03:forgejo.coulomb.social→ railiance01 (92.205.62.239). DNS active; Traefik edge live; Forgejo workload deployed and serving HTTPS. Gitea remains canonical until migration drills pass. Record:the-custodian/docs/forgejo-production-decisions.md.- Mail delivery path for password reset and account recovery (SMTP relay, sender domain, SPF/DKIM/DMARC expectations).
- Package registry scope: container images only at first, or also generic, npm, PyPI, Go, Maven, and Helm packages.
Actions runner modelDECIDED 2026-07-03: in-cluster long-lived runner Deployment with DinD sidecar on railiance01 (ADR-004). Interim coulombcore host runner retired after cutover.- Backup destination and retention target for database, repositories, attachments, LFS, Actions artifacts/logs, and package data.
- Cutover mode:
freeze-all vs stagedLEANING staged per-repo (2026-07-04) based onglas-harnesspilot; operator confirmation still needed. Freeze-all remains fallback for final production wave if drift risk is unacceptable.
Safety Contract
- Gitea remains the production source of truth until Forgejo restore and migration drills pass.
- No repository is deleted from Gitea during this workplan.
- A fresh Gitea backup must be taken before every migration drill and before final cutover.
- Forgejo backups must be restored into an isolated namespace before accepting production use.
- Password reset and email recovery must be verified with a real controlled account before onboarding users.
- Forgejo Actions may not receive broad cluster credentials by default; runner permissions must be least-privilege and repo-scoped where practical.
- Secrets stay in SOPS/age or Kubernetes Secrets managed by the appropriate repo. No plaintext SMTP passwords, admin tokens, runner tokens, or registry credentials in Git.
Probe and pilot strategy (revised 2026-07-04)
Original T03 planned a disposable isolated-namespace probe before any production install. That path was superseded: production Forgejo deployed on railiance01 under the safety contract (Gitea remains canonical; no Gitea deletes).
Integration evidence now comes from in-production probes and repo pilots:
| Tier | Repo | Purpose | Status |
|---|---|---|---|
| 0 | coulomb/forgejo-actions-probe |
Runner scheduling, DinD, OCI image-build | done |
| 1 | coulomb/glas-harness |
Non-production git+SSH+CI routing drill | done |
| 2 | coulomb/key-cape |
Image-build workflow + registry pull on railiance01 | done |
| 3 | Production set (state-hub, issue-core, …) |
Canonical remotes, sweep paths, deploy loops | gated |
Each tier must pass before the next. T03 (isolated probe namespace) is cancelled; acceptance criteria below are tracked across T05, T07, T08, and T10 instead.
Still to prove before T11:
- SMTP/password reset end-to-end (T06).
- Backup and restore in isolated namespace (T09) — drill passed 2026-07-04; scheduled automation pending.
- Issues/releases/wiki/LFS per inventory classification (T10 matrix).
- Operator SSH identity on Forgejo beyond interim
forgejo_adminkeys (T02/T10).
Target Architecture
operator / agents / developers
-> https://forgejo.coulomb.social (railiance01 Traefik ingress)
-> forgejo Service in forgejo namespace
-> Forgejo Deployment/StatefulSet
-> forgejo-db CloudNative PG Cluster in databases namespace
-> Valkey/cache if required
-> persistent storage for repositories, attachments, LFS, packages
-> Actions runner(s) with restricted execution scope
-> backup jobs to the approved backup target
Tasks
T01 — Inventory current Gitea functionality and migration requirements
id: RAIL-HO-WP-0005-T01
status: progress
priority: high
state_hub_task_id: "cf59d171-5629-45c9-9d44-8d6499827ffc"
Create a source-of-truth inventory of current Gitea usage.
First-pass inventory artifact: docs/forgejo-migration-inventory.md.
Minimum inventory:
- All repositories in the
coulomborganization. - Registered vs unregistered State Hub repos.
- Users, organizations, teams, deploy keys, SSH keys, access tokens.
- Issues, labels, milestones, releases, wiki, packages, LFS, attachments.
- Existing webhook usage and automation assumptions.
- Current Gitea package registry status and the missing
[packages]config that is blocking container image publication.
Done when: the inventory identifies every feature that must work in Forgejo before cutover and classifies each migration item as automatic, manual, unsupported, or explicitly out of scope.
Gap (2026-07-04): first-pass inventory predates repos created after
2026-06-04 (e.g. glas-harness, forgejo-actions-probe). Refresh org repo
list and add a migration tier column (0–3) per repo before T11.
T02 — Resolve Forgejo production design decisions
id: RAIL-HO-WP-0005-T02
status: progress
priority: high
needs_human: true
state_hub_task_id: "f88115bf-4f99-49ef-a415-0b23750141b3"
Decide the production choices listed in "Key Decisions to Confirm".
Partial (2026-07-04): hostname, exposure, deployment pattern, live deploy,
and in-cluster runner model decided (ADR-004). Cutover mode leaning staged
per-repo (glas-harness pilot). Remaining operator decisions: SMTP, package scope
beyond OCI, backup target, final cutover confirmation. See
the-custodian/docs/forgejo-production-decisions.md.
Expected output:
- A short decision record in this workplan or a dedicated ADR.
- Hostname and exposure model. ✓ hostname; exposure follows railiance01 Traefik
- SMTP provider and sender identity.
- Package registry scope.
- Actions runner isolation model.
- Backup target, retention, encryption, and restore cadence.
- Cutover strategy and rollback window.
Done when: implementation tasks are no longer blocked by open production choices.
T03 — Build forgejo-railiance-probe (isolated namespace)
id: RAIL-HO-WP-0005-T03
status: cancel
priority: high
state_hub_task_id: "b516018a-415e-4a58-8c62-07c14ece9353"
Cancelled 2026-07-04: superseded by production Forgejo on railiance01 (T05)
plus in-production integration probes (forgejo-actions-probe, glas-harness).
Isolated-namespace probe added latency without reducing risk given the safety
contract (Gitea canonical, no deletes). Remaining T03 acceptance items map to:
T05 (deploy), T06 (mail), T07 (packages), T08 (Actions), T09 (backup restore),
T10 (repo migration drill).
T04 — Define Forgejo platform services
id: RAIL-HO-WP-0005-T04
status: todo
priority: high
state_hub_task_id: "28b351fe-bfbe-4a8b-bbfa-1b148e69f8e0"
In railiance-platform, define production platform services for Forgejo.
Minimum scope:
forgejo-dbCloudNative PG cluster.- Database credentials via SOPS-managed Secret or approved secret flow.
- Backup configuration for database base backups and WAL archiving.
- Object storage or persistent volume plan for repositories, attachments, LFS, packages, Actions artifacts, and logs.
- Restore runbook for database and blob/package data.
Partial (2026-07-04): forgejo-db CNPG cluster healthy on railiance01
(make forgejo-db-status → Cluster in healthy state). SOPS secret path and
network policies in railiance-platform. Remaining: backup/WAL archiving to
approved target, blob/package storage restore drill (feeds T09).
Done when: platform dependencies can be deployed and restored without the Forgejo app running.
T05 — Define production Forgejo application deployment
id: RAIL-HO-WP-0005-T05
status: progress
priority: high
state_hub_task_id: "11540ba4-d31c-4f64-836b-c6de69107aa4"
In railiance-apps, create the production Forgejo deployment.
Minimum scope:
- Forgejo Helm release or manifests in the S5 boundary.
- App configuration for database, SSH, HTTPS, mailer, packages, LFS, and security settings.
- Initial admin/user bootstrap that is automated but does not commit secrets.
- Health/status targets in the Makefile.
- Migration-safe configuration for coexistence with Gitea during the cutover.
Partial (2026-07-04): railiance-apps deploy live — HTTPS smoke pass,
ingress + TLS, SSH NodePort 30022, Actions enabled, coulomb org,
railiance01-build-01 runner (ADR-004). Git push/pull via HTTPS and
forgejo-remote SSH proven. Remaining: SOPS hardening for all secrets,
SMTP (T06), operator user accounts beyond forgejo_admin.
Done when: Forgejo runs on railiance01 against production platform services and can serve login, git clone/push, package registry, and admin operations.
T06 — Implement usable email recovery cycle
id: RAIL-HO-WP-0005-T06
status: todo
priority: high
needs_human: true
state_hub_task_id: "417faa4d-eab8-4247-9485-4f80e5d5b7ff"
Configure and test mail delivery for account recovery.
Minimum scope:
- SMTP credentials stored through the approved secret path.
- Sender address and domain alignment documented.
- Password reset email works for a controlled non-admin account.
- Account recovery runbook covers lost password, lost MFA, disabled account, and emergency admin access.
- Mail failure is observable through logs or a health check.
Done when: a user can complete password recovery without operator database edits, and the operator has a documented emergency path.
T07 — Enable and harden package registry base
id: RAIL-HO-WP-0005-T07
status: todo
priority: high
state_hub_task_id: "9578f672-e2b8-43a3-8419-5f86f8871326"
Enable Forgejo packages for Railiance's near-term build and deployment needs.
Initial package types:
- Container registry for State Hub and future app images.
- Generic packages for release artifacts.
- Additional package types only after the inventory proves they are needed.
Acceptance:
- Authenticated push and pull works from operator workstation and railiance01.
- Container image pull works from k3s deployments.
- Retention and cleanup expectations are documented.
- Package data is included in backup and restore drills.
Partial (2026-07-04): OCI registry live (/v2/ auth challenge). Tier-0/2
images built and pulled on railiance01: forgejo-actions-probe, key-cape
(crictl pull forgejo.coulomb.social/coulomb/key-cape:latest succeeded).
Remaining: state-hub image after tier-3 approval; document retention; include
packages in backup drill (T09).
Done when: tier-2 gate is fully satisfied (✓) and tier-3 production images follow the same pattern after explicit approval.
T08 — Enable Forgejo Actions
id: RAIL-HO-WP-0005-T08
status: progress
priority: high
state_hub_task_id: "f45f98c9-2f02-4224-bbfd-c2e1ec38581e"
Enable Forgejo Actions with a least-privilege runner model.
Minimum scope:
- Runner registration automated without committing runner tokens.
- Runner isolation model documented.
- Minimal workflows for lint/test/build on representative repositories.
- Workflow to build and publish a probe container image to Forgejo packages.
- Secret handling policy for Actions.
- Resource limits to avoid repeating previous single-node overload patterns.
Partial (2026-07-04): in-cluster runner live (railiance-apps/manifests/ forgejo-runner.yaml, ADR-004). Proven workflows: forgejo-actions-probe
(image-build), glas-harness (host+container CI smoke). Org secrets
REGISTRY_USER/REGISTRY_TOKEN set. Documented constraints: host runner is
non-root (static docker-cli, no apk add); actions/checkout@v4 fails — use
git clone in job. Remaining: reusable workflow templates in
railiance-enablement (S4); resource limits review; no cluster-admin on runner.
Partial (2026-07-04): tier-2 satisfied by key-cape (container-build,
archive checkout, static docker-cli). Remaining: publish reusable workflow
template in railiance-enablement (S4).
Done when: tier-2 pilot repo runs Forgejo Actions end-to-end and publishes a pullable image without privileged cluster-wide credentials. Tier 2: done.
T09 — Implement Forgejo backup and restore automation
id: RAIL-HO-WP-0005-T09
status: progress
priority: high
state_hub_task_id: "25892007-36ca-4bd9-8adf-84d505465d7d"
Create backup automation for all Forgejo state.
Must cover:
- PostgreSQL database.
- Git repositories.
- Attachments.
- LFS.
- Packages.
- Avatars and app data.
- Actions logs/artifacts if retained.
- App configuration required for restore.
Acceptance:
- Scheduled backups run without manual intervention.
- Backups are encrypted or stored in an approved protected target.
- Restore into an isolated namespace is drilled and documented.
- RPO/RTO expectations are recorded.
Partial (2026-07-04): isolated restore drill passed. Production
forgejo dump (~11.7 MiB) restored into forgejo-restore-drill namespace;
post-restore API checks: health 200, coulomb/glas-harness and
coulomb/key-cape on main, 3 org repos visible. Evidence:
docs/forgejo-restore-drill-evidence.md. Assets: infra/forgejo-restore-drill/,
tools/forgejo-restore-drill.sh. Remaining: scheduled CNPG/off-cluster backups,
encryption/approved target (T02/T04), automated dump schedule.
Done when: a fresh backup restores to a working isolated Forgejo instance with repository, package, and user recovery checks passing and scheduled backups run without manual intervention.
T10 — Drill Gitea to Forgejo migration (staged ladder)
id: RAIL-HO-WP-0005-T10
status: progress
priority: high
state_hub_task_id: "6befde73-00bc-4643-be0b-a7ce7944e75f"
Run staged migration drills from Gitea to Forgejo before production repos move.
Tier 1 complete (2026-07-04): glas-harness — git history preserved,
origin on Forgejo, gitea legacy remote retained, SSH+HTTPS push, CI smoke
green. Result matrix:
the-custodian/docs/forgejo-repo-migration-pilot-glas-harness.md.
Minimum checks (per tier):
- Git history and default branches preserved.
- Issues, labels, milestones, releases, wiki, and attachments handled per inventory classification (N/A for tier-1 git-only repos).
- SSH/HTTPS clone and push paths work (
forgejo-remotein~/.ssh/config). - Existing local remotes can be transformed predictably (
origin/giteasplit). - State Hub registered repo remotes can be updated safely (deferred for tier-1).
- Rollback plan is rehearsed (Gitea copy unchanged).
Tier 2 complete (2026-07-04): key-cape — multi-stage Dockerfile built and
pushed via archive-checkout workflow; crictl pull on railiance01 succeeded.
Evidence in the-custodian/docs/forgejo-repo-migration-pilot-glas-harness.md
(tier 2 section).
Not ready: state-hub (tier 3) until hub-core build context template and
sweep remote_url playbook exist.
Done when: tiers 0–2 pass with written result matrices and no unknown critical migration gaps remain for production repos.
T11 — Production cutover from Gitea to Forgejo
id: RAIL-HO-WP-0005-T11
status: todo
priority: high
needs_human: true
state_hub_task_id: "b1b66687-ca33-4971-b312-743c8e059c5e"
Execute production migration only after T06, T07, T08, T09, and T10 tier 0–2
gates pass. state-hub and other Wave-1 production repos require explicit
operator approval per CUST-WP-0054 drain sequence.
Preferred cutover (staged per-repo):
- Per repo: Gitea backup snapshot (or org-wide before each wave).
- Mirror git to Forgejo; switch workstation
origintoforgejo-remote. - Port/verify Actions workflows on Forgejo runner.
- Update State Hub
remote_urland railiance01 sweep checkouts when promoted. - Mark Gitea repo read-only (org policy); do not delete.
- Repeat until production set complete.
Freeze-all fallback: single window if staged drift is unacceptable — same steps but all repos in one maintenance period.
Done when: all Railiance/Custodian repos use Forgejo as primary, Gitea is read-only fallback, and rollback instructions are documented.
T12 — Retire or archive legacy Gitea
id: RAIL-HO-WP-0005-T12
status: todo
priority: medium
needs_human: true
state_hub_task_id: "a63147b0-31d5-4705-89ea-40c10faf779f"
Retire legacy Gitea only after a stabilization period and explicit approval.
Minimum scope:
- Confirm no active remotes, webhooks, packages, or dashboards depend on Gitea.
- Preserve final Gitea backup.
- Update runbooks and dashboards from Gitea to Forgejo.
- Remove or archive Gitea Helm release according to the rollback decision.
- Close stale State Hub references to
railiance-bootstrapif confirmed as an alias rather than a real repo.
Done when: Forgejo is the only active source forge and package base, with legacy Gitea either archived or intentionally retained as documented fallback.
Phasing and Dependencies
T01 inventory ──► T02 decisions ──┬──► T04 platform (forgejo-db ✓ partial)
├──► T05 app (live ✓ partial)
├──► T06 mail recovery
├──► T07 packages (OCI probe ✓ partial)
├──► T08 actions (runner ✓ partial)
└──► T09 backups
T05+T08 ──► T10 migration ladder ──► T11 production cutover ──► T12 Gitea retire
tier0 probe ✓
tier1 glas-harness ✓
tier2 key-cape ✓
tier3 production (gated)
T03 isolated probe: CANCELLED (superseded by T05 + in-production pilots)
Current focus (2026-07-04): T10 tiers 0–2 complete; T09 restore drill
passed (scheduled backups + backup target still open); T02 decisions (SMTP,
backup target) before tier-3 production repos.
Do not start T11 state-hub until T09 complete and CUST-WP-0054 Wave-1
gates satisfied.
Absorbed by CUST-WP-0054-T04: forge + CI on railiance01; workstation
build retirement; staged repo promotion before State Hub primary move (T05).
railiance-bootstrap Note
State Hub currently registers both railiance-bootstrap and
railiance-cluster, but they point to the same local path
(/home/worsch/railiance-cluster) and the same git fingerprint. The
railiance-bootstrap entry has no remote URL. The earlier restructure workplan
(RAIL-HO-WP-0003-T03) says railiance-bootstrap was renamed to
railiance-cluster.
Working assumption: railiance-bootstrap is a stale logical alias or leftover
repo goal, not a separate Gitea repository. This workplan should not create a
new Forgejo repository named railiance-bootstrap unless a concrete remaining
purpose is identified.
References
RAIL-HO-WP-0004-production-readiness.mdRAIL-HO-WP-0003-5repo-stack-restructure.mdCUST-WP-0054-workstation-independence-and-fleet-realignment.md(T04 forge+CI)CUST-WP-0014-repo-sync-automation.mdCUST-WP-0021-multi-host-repo-paths.mddocs/adr/ADR-004-forgejo-in-cluster-actions-runner.mddocs/forgejo-migration-inventory.mdthe-custodian/docs/forgejo-production-decisions.mdthe-custodian/docs/forgejo-repo-migration-pilot-glas-harness.mdrailiance-apps/docs/forgejo-on-railiance01.mdrailiance-forge/docs/forgejo-actions-runner-substrate.mdops/incidents/2026-03-25-gitea-pgpool-crashloop.mdops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md