diff --git a/docs/adr/ADR-004-forgejo-in-cluster-actions-runner.md b/docs/adr/ADR-004-forgejo-in-cluster-actions-runner.md new file mode 100644 index 0000000..1f56d2a --- /dev/null +++ b/docs/adr/ADR-004-forgejo-in-cluster-actions-runner.md @@ -0,0 +1,104 @@ +# ADR-004 — Forgejo In-Cluster Actions Runner on railiance01 + +**Status:** Accepted +**Date:** 2026-07-03 +**Deciders:** Bernd Worsch (operator), custodian agents +**Workplans:** `RAIL-HO-WP-0005-T02`, `CUST-WP-0054-T04` + +--- + +## Context + +Forgejo production runs on **railiance01 k3s** (`railiance-apps`, S5). An interim +**host runner** on coulombcore proved Actions scheduling (`coulomb/forgejo-actions-probe`) +but: + +- coulombcore is a legacy machine slated for drain (CUST-WP-0054-T03). +- Host runners require Docker or Podman on the OS — not installed, not desired on + coulombcore long term. +- Forgejo upstream recommends **not** co-locating runners on the same machine as the + forge instance; in-cluster **separate pods** satisfy isolation while staying on the + production fleet node. +- `RAIL-HO-WP-0005-T02` left the runner model undecided among host, in-cluster, and + ephemeral options. + +Goal: a **coherent Kubernetes-from-the-start** CI substrate — Forgejo app, database, +ingress, and Actions runner all lifecycle-managed on railiance01. + +## Decision + +### Runner placement + +Deploy **one long-lived Forgejo Actions runner Deployment** in the `forgejo` namespace +on railiance01: + +| Component | Implementation | +| --- | --- | +| Runner | `data.forgejo.org/forgejo/runner:6.3.1` | +| Container runtime for jobs | `docker:dind` sidecar (privileged) | +| State | PVC `forgejo-runner-data` (`.runner`, `config.yaml`, action cache) | +| Registration scope | `coulomb` organization | +| Runner name | `railiance01-build-01` | +| Deploy surface | `railiance-apps/manifests/forgejo-runner.yaml` | +| Operator targets | `make forgejo-runner-deploy`, `forgejo-runner-status` | + +### Label contract + +Preserve Gitea migration compatibility and semantic capability labels: + +```text +self-hosted:host,linux:host,linux_amd64:host,container-build:host,registry-publish:host,railiance01:host,ubuntu-latest:docker://node:20-bookworm,docker:docker://node:20-bookworm +``` + +### Security boundaries + +- Runner pod receives **no** cluster-admin kubeconfig and **no** OpenBao tokens by default. +- `registry-publish` jobs use **repo/org-scoped Forgejo secrets** only. +- DinD sidecar runs **privileged** — accepted for single-node railiance01 with + dedicated `forgejo` namespace; revisit when a third node or multi-tenant runners appear. +- Registration tokens live in Kubernetes Secret `forgejo-runner-registration` (SOPS + template committed; live value never in Git). + +### Retire interim host runner + +Stop and disable `forgejo-runner.service` on coulombcore after in-cluster runner is +healthy. Do not register new host runners without an explicit ADR amendment. + +## Alternatives considered + +| Option | Outcome | +| --- | --- | +| Host runner + Docker on coulombcore | Rejected — legacy host, contradicts drain plan | +| Host runner + Podman on haskelseed | Viable fallback; not chosen as primary | +| Kaniko/Buildah without DinD | Deferred — higher workflow churn during Gitea migration | +| Multiple ephemeral runner Jobs | Deferred — start with capacity=1 long-lived pod | + +## Consequences + +**Positive** + +- Single-machine production loop: forge + runner on railiance01, workstation not required. +- Container image CI (`docker build` / `docker push`) works without OS-level Docker. +- Runner upgrades roll with Git-managed manifests and `kubectl`/Makefile. + +**Negative / follow-on** + +- Privileged DinD increases blast radius within the node — monitor and restrict namespace RBAC. +- SOPS-encrypted registration secret still requires operator age key. +- `cluster-deploy` / `s5-release-check` labels remain **out of scope** until credential paths reviewed. + +## Ownership (OAS) + +| Concern | Repo | Layer | +| --- | --- | --- | +| ADR + umbrella sequencing | `railiance-infra` | S1 | +| Runner manifests + Makefile | `railiance-apps` | S5 | +| Label contract + runner evidence docs | `railiance-forge` | S5 forge substrate | +| Reusable workflow templates | `railiance-enablement` | S4 | + +## References + +- `railiance-apps/docs/forgejo-on-railiance01.md` +- `railiance-forge/docs/forgejo-actions-runner-substrate.md` +- `the-custodian/docs/forgejo-production-decisions.md` +- [Forgejo runner installation](https://forgejo.org/docs/v11.0/admin/actions/runner-installation/) \ No newline at end of file diff --git a/workplans/RAIL-HO-WP-0005-forgejo-production-migration.md b/workplans/RAIL-HO-WP-0005-forgejo-production-migration.md index 4614e83..be28c70 100644 --- a/workplans/RAIL-HO-WP-0005-forgejo-production-migration.md +++ b/workplans/RAIL-HO-WP-0005-forgejo-production-migration.md @@ -46,14 +46,18 @@ change is made there. ## Key Decisions to Confirm -1. Public/private hostname for Forgejo and whether Gitea remains reachable - during the transition. +1. ~~Public/private hostname for Forgejo~~ **DECIDED 2026-07-03:** + `forgejo.coulomb.social` → railiance01 (`92.205.62.239`). DNS active; + Traefik edge live; Forgejo workload not deployed yet (404). Gitea remains + canonical until migration drills pass. Record: + `the-custodian/docs/forgejo-production-decisions.md`. 2. Mail delivery path for password reset and account recovery (SMTP relay, sender domain, SPF/DKIM/DMARC expectations). 3. Package registry scope: container images only at first, or also generic, npm, PyPI, Go, Maven, and Helm packages. -4. Actions runner model: in-cluster ephemeral runners, long-lived runner pod, - or isolated host runner. +4. ~~Actions runner model~~ **DECIDED 2026-07-03:** in-cluster long-lived runner + Deployment with DinD sidecar on railiance01 (`ADR-004`). Interim coulombcore + host runner retired after cutover. 5. Backup destination and retention target for database, repositories, attachments, LFS, Actions artifacts/logs, and package data. 6. Cutover mode: freeze-and-migrate all repos in one window, or staged @@ -98,8 +102,7 @@ The probe is destroyed or explicitly archived after production Forgejo is live. ``` operator / agents / developers - -> private HTTPS endpoint - -> railiance01 ingress + -> https://forgejo.coulomb.social (railiance01 Traefik ingress) -> forgejo Service in forgejo namespace -> Forgejo Deployment/StatefulSet -> forgejo-db CloudNative PG Cluster in databases namespace @@ -144,7 +147,7 @@ manual, unsupported, or explicitly out of scope. ```task id: RAIL-HO-WP-0005-T02 -status: todo +status: progress priority: high needs_human: true state_hub_task_id: "f88115bf-4f99-49ef-a415-0b23750141b3" @@ -152,10 +155,14 @@ state_hub_task_id: "f88115bf-4f99-49ef-a415-0b23750141b3" Decide the production choices listed in "Key Decisions to Confirm". +**Partial (2026-07-03):** hostname and in-cluster runner model decided (`ADR-004`). +Remaining: SMTP, package scope, backup, cutover mode. See +`the-custodian/docs/forgejo-production-decisions.md`. + Expected output: - A short decision record in this workplan or a dedicated ADR. -- Hostname and exposure model. +- Hostname and exposure model. ✓ hostname; exposure follows railiance01 Traefik - SMTP provider and sender identity. - Package registry scope. - Actions runner isolation model. @@ -229,7 +236,7 @@ Forgejo app running. ```task id: RAIL-HO-WP-0005-T05 -status: todo +status: progress priority: high state_hub_task_id: "11540ba4-d31c-4f64-836b-c6de69107aa4" ``` @@ -245,6 +252,10 @@ Minimum scope: - Health/status targets in the Makefile. - Migration-safe configuration for coexistence with Gitea during the cutover. +**Partial (2026-07-03):** `railiance-apps` deploy live — HTTPS smoke pass, Actions +enabled, `coulomb` org + probe workflow success. Remaining: SOPS secrets, +SMTP, Docker on runner host for image builds, migration drills. + **Done when:** Forgejo runs on railiance01 against production platform services and can serve login, git clone/push, package registry, and admin operations.