docs: add ThreePhoenix architecture concept and workplan

RailianceThreePhoenix: 3-node HA Kubernetes cluster with embedded etcd, Longhorn distributed storage, PostgreSQL HA (repmgr + Pgpool-II), and Phoenix CronJob for weekly node rotation to prevent configuration drift. ThreePhoenixWorkplan: 7-phase implementation plan from blank Ubuntu nodes to self-healing Gitea cluster with monitoring and alert silencing. Also adds CLAUDE.md with Custodian State Hub session protocol. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 01:13:05 +01:00 · 2026-02-25 01:13:05 +01:00 · eb8a6902b6
commit eb8a6902b6
parent b7696e657f
3 changed files with 477 additions and 0 deletions
--- a/wiki/RailianceThreePhoenix.md
+++ b/wiki/RailianceThreePhoenix.md
@ -0,0 +1,98 @@
+RailianceThreePhoenix
+
+*Three machine failover loadbalancing *
+
+Architecture documentation for **RailianceThreePhoenix** service operations automation. 
+
+This document is designed to be the "source of truth" for Railiance infrastructure, enabling deployment of future services (like Zulip, Matrix, ...) using a resilient loadbalancing and failover pattern to efficiently run cloud services. 
+
+Setting up and running Gitea on PostgreSQL in Kubernetes on Ubuntu will serve as the practical usecase and reference implementation for this DevopsPattern.
+
+# ThreePhoenix System Architecture
+
+**Version:** 1.0 | **Status:** Draft | **Type:** High-Availability Kubernetes Cluster
+
+### 1. Executive Summary
+
+The ThreePhoenix architecture is a **self-healing, 3-node Kubernetes cluster** designed for high availability and automated maintenance. It utilizes a "Phoenix Server" pattern where application components are regularly destroyed and recreated from scratch to prevent configuration drift, memory leaks, and state corruption.
+
+### 2. Physical & Infrastructure Layer
+
+* **Hardware:** 3x Ubuntu Server nodes (Physical or Virtual).
+* **Orchestration:** **K3s** (Lightweight Kubernetes).
+* **Topology:** Multi-Master HA (Embedded etcd datastore).
+* **Failure Tolerance:** Cluster survives the loss of any single node (N-1 redundancy).
+
+
+* **Storage (CSI):** **Longhorn** (Distributed Block Storage).
+* **Replication:** Volume data is synchronously replicated across all 3 nodes.
+* **Access Mode:** `ReadWriteMany` (RWX) enabled for shared application data (e.g., Gitea repositories).
+
+
+
+### 3. Application Stack (The Standard Unit)
+
+Every stateful service deployed to the cluster (e.g., Gitea) must adhere to this topology:
+
+| Layer | Component | Configuration Strategy |
+| --- | --- | --- |
+| **Ingress** | Nginx Ingress | **SSL Termination** via Cert-Manager (Let's Encrypt). No ports exposed directly. |
+| **Traffic** | ClusterIP | Internal-only communication. |
+| **Routing** | Pgpool-II | **Load Balancing:** Reads (SELECT) distributed to 3 nodes. Writes (INSERT) sent to Primary. |
+| **Compute** | Stateless App | **ReplicaCount: 3**. Pod anti-affinity ensures one pod per physical node. |
+| **Database** | PostgreSQL HA | **Repmgr Cluster:** 1 Primary, 2 Standbys. Asynchronous replication. |
+| **Data** | Persistent Volume | **Longhorn StorageClass.** ReclaimPolicy: Retain (for safety) or Delete (if relying on Phoenix). |
+
+### 4. The "Phoenix" Automation Engine
+
+A centralized **CronJob** (`phoenix-maintenance`) manages the lifecycle of stateful workloads.
+
+* **Schedule:** Weekly (Sunday 03:00 UTC).
+* **Cycle:** 3-Week Rotation.
+* **Week 1:** Destroy & Re-clone Standby Node B.
+* **Week 2:** Destroy & Re-clone Standby Node C.
+* **Week 3:** **Switchover Event.** Promote Standby B to Primary -> Destroy old Primary Node A.
+
+
+* **Objective:** No database pod lives longer than 21 days.
+
+---
+
+### Appendix A: Acceptance Criteria (The Audit Checklist)
+
+Use this checklist for your monthly/quarterly "Health Check." If any item fails, the system is deteriorating.
+
+#### I. Infrastructure Integrity
+
+* [ ] **Node Health:** All 3 nodes report `Ready` status in `kubectl get nodes`.
+* [ ] **Distribution:** `kubectl get pods -o wide` confirms Gitea pods are running on 3 *different* physical nodes (Anti-Affinity is working).
+* [ ] **Storage Sync:** Longhorn UI shows all volumes have "Healthy" status with **3 replicas**. No "Degraded" volumes allowed.
+
+#### II. Database & Persistence
+
+* [ ] **Cluster State:** `kubectl exec <primary-pod> -- repmgr cluster show` lists exactly **1 Primary** and **2 Standbys**.
+* [ ] **Replication Lag:** Lag is `< 1 second` for all standbys (visible in Grafana or Pgpool status).
+* [ ] **Load Balancing:** Pgpool logs confirm `SELECT` queries are being routed to Standby nodes (verifies Read-Scaling is active).
+* [ ] **Backup Validation:** A backup file exists in the external S3 bucket/location with a timestamp `< 24 hours` old. **Crucial:** File size is consistent with previous days.
+
+#### III. Security & Network
+
+* [ ] **SSL Validity:** `git.yourdomain.com` certificate expires in `> 30 days`.
+* [ ] **Port Scan:** Running `nmap` against the public IP reveals **ONLY** ports 80 (HTTP) and 443 (HTTPS). Database ports (5432) must be `Closed`/`Filtered`.
+* [ ] **Ingress Check:** Accessing the application via HTTP automatically redirects to HTTPS (301 Redirect).
+
+#### IV. Phoenix Mechanics
+
+* [ ] **Job History:** `kubectl get jobs` shows the last `phoenix-maintenance` job has status `Completed` (not `Failed`).
+* [ ] **Pod Age:** No `postgresql` pod has an "Age" greater than **22 days**. (If one is 170 days old, the automation is broken).
+
+#### V. Disaster Recovery Drill (Quarterly)
+
+* [ ] **The "Kill" Test:** Manually delete a Gitea Pod.
+* *Pass Criteria:* Site remains accessible (via other 2 pods). New pod spawns and joins within 2 minutes.
+
+
+* [ ] **The "Restore" Test:** Restore the database backup to a *test* namespace.
+* *Pass Criteria:* You can log in and see the latest repositories.
+
+xxx