docs: add ThreePhoenix architecture concept and workplan
RailianceThreePhoenix: 3-node HA Kubernetes cluster with embedded etcd, Longhorn distributed storage, PostgreSQL HA (repmgr + Pgpool-II), and Phoenix CronJob for weekly node rotation to prevent configuration drift. ThreePhoenixWorkplan: 7-phase implementation plan from blank Ubuntu nodes to self-healing Gitea cluster with monitoring and alert silencing. Also adds CLAUDE.md with Custodian State Hub session protocol. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
b7696e657f
commit
eb8a6902b6
3 changed files with 477 additions and 0 deletions
98
wiki/RailianceThreePhoenix.md
Normal file
98
wiki/RailianceThreePhoenix.md
Normal file
|
|
@ -0,0 +1,98 @@
|
|||
RailianceThreePhoenix
|
||||
|
||||
*Three machine failover loadbalancing *
|
||||
|
||||
Architecture documentation for **RailianceThreePhoenix** service operations automation.
|
||||
|
||||
This document is designed to be the "source of truth" for Railiance infrastructure, enabling deployment of future services (like Zulip, Matrix, ...) using a resilient loadbalancing and failover pattern to efficiently run cloud services.
|
||||
|
||||
Setting up and running Gitea on PostgreSQL in Kubernetes on Ubuntu will serve as the practical usecase and reference implementation for this DevopsPattern.
|
||||
|
||||
# ThreePhoenix System Architecture
|
||||
|
||||
**Version:** 1.0 | **Status:** Draft | **Type:** High-Availability Kubernetes Cluster
|
||||
|
||||
### 1. Executive Summary
|
||||
|
||||
The ThreePhoenix architecture is a **self-healing, 3-node Kubernetes cluster** designed for high availability and automated maintenance. It utilizes a "Phoenix Server" pattern where application components are regularly destroyed and recreated from scratch to prevent configuration drift, memory leaks, and state corruption.
|
||||
|
||||
### 2. Physical & Infrastructure Layer
|
||||
|
||||
* **Hardware:** 3x Ubuntu Server nodes (Physical or Virtual).
|
||||
* **Orchestration:** **K3s** (Lightweight Kubernetes).
|
||||
* **Topology:** Multi-Master HA (Embedded etcd datastore).
|
||||
* **Failure Tolerance:** Cluster survives the loss of any single node (N-1 redundancy).
|
||||
|
||||
|
||||
* **Storage (CSI):** **Longhorn** (Distributed Block Storage).
|
||||
* **Replication:** Volume data is synchronously replicated across all 3 nodes.
|
||||
* **Access Mode:** `ReadWriteMany` (RWX) enabled for shared application data (e.g., Gitea repositories).
|
||||
|
||||
|
||||
|
||||
### 3. Application Stack (The Standard Unit)
|
||||
|
||||
Every stateful service deployed to the cluster (e.g., Gitea) must adhere to this topology:
|
||||
|
||||
| Layer | Component | Configuration Strategy |
|
||||
| --- | --- | --- |
|
||||
| **Ingress** | Nginx Ingress | **SSL Termination** via Cert-Manager (Let's Encrypt). No ports exposed directly. |
|
||||
| **Traffic** | ClusterIP | Internal-only communication. |
|
||||
| **Routing** | Pgpool-II | **Load Balancing:** Reads (SELECT) distributed to 3 nodes. Writes (INSERT) sent to Primary. |
|
||||
| **Compute** | Stateless App | **ReplicaCount: 3**. Pod anti-affinity ensures one pod per physical node. |
|
||||
| **Database** | PostgreSQL HA | **Repmgr Cluster:** 1 Primary, 2 Standbys. Asynchronous replication. |
|
||||
| **Data** | Persistent Volume | **Longhorn StorageClass.** ReclaimPolicy: Retain (for safety) or Delete (if relying on Phoenix). |
|
||||
|
||||
### 4. The "Phoenix" Automation Engine
|
||||
|
||||
A centralized **CronJob** (`phoenix-maintenance`) manages the lifecycle of stateful workloads.
|
||||
|
||||
* **Schedule:** Weekly (Sunday 03:00 UTC).
|
||||
* **Cycle:** 3-Week Rotation.
|
||||
* **Week 1:** Destroy & Re-clone Standby Node B.
|
||||
* **Week 2:** Destroy & Re-clone Standby Node C.
|
||||
* **Week 3:** **Switchover Event.** Promote Standby B to Primary -> Destroy old Primary Node A.
|
||||
|
||||
|
||||
* **Objective:** No database pod lives longer than 21 days.
|
||||
|
||||
---
|
||||
|
||||
### Appendix A: Acceptance Criteria (The Audit Checklist)
|
||||
|
||||
Use this checklist for your monthly/quarterly "Health Check." If any item fails, the system is deteriorating.
|
||||
|
||||
#### I. Infrastructure Integrity
|
||||
|
||||
* [ ] **Node Health:** All 3 nodes report `Ready` status in `kubectl get nodes`.
|
||||
* [ ] **Distribution:** `kubectl get pods -o wide` confirms Gitea pods are running on 3 *different* physical nodes (Anti-Affinity is working).
|
||||
* [ ] **Storage Sync:** Longhorn UI shows all volumes have "Healthy" status with **3 replicas**. No "Degraded" volumes allowed.
|
||||
|
||||
#### II. Database & Persistence
|
||||
|
||||
* [ ] **Cluster State:** `kubectl exec <primary-pod> -- repmgr cluster show` lists exactly **1 Primary** and **2 Standbys**.
|
||||
* [ ] **Replication Lag:** Lag is `< 1 second` for all standbys (visible in Grafana or Pgpool status).
|
||||
* [ ] **Load Balancing:** Pgpool logs confirm `SELECT` queries are being routed to Standby nodes (verifies Read-Scaling is active).
|
||||
* [ ] **Backup Validation:** A backup file exists in the external S3 bucket/location with a timestamp `< 24 hours` old. **Crucial:** File size is consistent with previous days.
|
||||
|
||||
#### III. Security & Network
|
||||
|
||||
* [ ] **SSL Validity:** `git.yourdomain.com` certificate expires in `> 30 days`.
|
||||
* [ ] **Port Scan:** Running `nmap` against the public IP reveals **ONLY** ports 80 (HTTP) and 443 (HTTPS). Database ports (5432) must be `Closed`/`Filtered`.
|
||||
* [ ] **Ingress Check:** Accessing the application via HTTP automatically redirects to HTTPS (301 Redirect).
|
||||
|
||||
#### IV. Phoenix Mechanics
|
||||
|
||||
* [ ] **Job History:** `kubectl get jobs` shows the last `phoenix-maintenance` job has status `Completed` (not `Failed`).
|
||||
* [ ] **Pod Age:** No `postgresql` pod has an "Age" greater than **22 days**. (If one is 170 days old, the automation is broken).
|
||||
|
||||
#### V. Disaster Recovery Drill (Quarterly)
|
||||
|
||||
* [ ] **The "Kill" Test:** Manually delete a Gitea Pod.
|
||||
* *Pass Criteria:* Site remains accessible (via other 2 pods). New pod spawns and joins within 2 minutes.
|
||||
|
||||
|
||||
* [ ] **The "Restore" Test:** Restore the database backup to a *test* namespace.
|
||||
* *Pass Criteria:* You can log in and see the latest repositories.
|
||||
|
||||
xxx
|
||||
Loading…
Add table
Add a link
Reference in a new issue