Compare commits
10 commits
688724f08b
...
c65e56acf1
| Author | SHA1 | Date | |
|---|---|---|---|
| c65e56acf1 | |||
| 98b6618dbc | |||
| c398bf5027 | |||
| d10741fb0d | |||
| 037a71f355 | |||
| 9c55dfb02a | |||
| 84c005254d | |||
| 5ac713641d | |||
| adb758b6d6 | |||
| c3a95e93b4 |
14 changed files with 622 additions and 81 deletions
|
|
@ -20,7 +20,7 @@ Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run wa
|
|||
| Agent runtime | How to orient |
|
||||
| --- | --- |
|
||||
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=railiance-cluster` is for coordination, not secret vending |
|
||||
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
|
||||
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workplans; **still** use `warden route` for credential ownership |
|
||||
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
|
||||
|
||||
### Quick routing table
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
## First Session Protocol
|
||||
|
||||
Triggered when `get_domain_summary("financials")` shows **no workstreams**.
|
||||
Triggered when `get_domain_summary("financials")` shows **no workplans**.
|
||||
The project is registered but work has not yet been structured.
|
||||
|
||||
**Step 1 — Read, don't write**
|
||||
|
|
@ -11,27 +11,31 @@ The project is registered but work has not yet been structured.
|
|||
**Step 2 — Survey in-progress work**
|
||||
Look for TODOs, open branches, half-finished files. Note done vs. started but incomplete.
|
||||
|
||||
**Step 3 — Propose workstreams to Bernd**
|
||||
Propose 1–3 workstreams — each a coherent strand, weeks to months, anchored to a
|
||||
**Step 3 — Propose workplans to Bernd**
|
||||
Propose 1–3 workplans — each a coherent strand, weeks to months, anchored to a
|
||||
roadmap phase. **Wait for approval before creating.**
|
||||
|
||||
**Step 4 — Create workplan file first, then DB record (ADR-001)**
|
||||
**Step 4 — Write the workplan file; fix-consistency registers it (ADR-001)**
|
||||
```
|
||||
workplans/RAIL-BS-WP-NNNN-<slug>.md ← write this first
|
||||
workplans/RAIL-BS-WP-NNNN-<slug>.md ← write this, commit it
|
||||
```
|
||||
Then register in the hub:
|
||||
```
|
||||
create_workstream(topic_id="ca369340-a64e-442e-98f1-a4fa7dc74a38", title="...", owner="...", description="...")
|
||||
create_task(workstream_id="<id>", title="...", priority="high|medium|low")
|
||||
Then register by running the consistency check — do **not** call
|
||||
`create_workplan`/`create_task` (or legacy `create_workstream`) yourself;
|
||||
manual registration duplicates what C-06 creates from the file:
|
||||
```bash
|
||||
statehub fix-consistency --repo railiance-cluster
|
||||
```
|
||||
C-06 creates the hub workplan + tasks and writes `state_hub_workstream_id` /
|
||||
`state_hub_task_id` back into the file (legacy field names, kept for
|
||||
compatibility — they hold workplan/task IDs).
|
||||
|
||||
**Step 5 — Record the setup**
|
||||
```
|
||||
add_progress_event(
|
||||
summary="First session: structured financials into N workstreams, M tasks",
|
||||
summary="First session: structured financials into N workplans, M tasks",
|
||||
event_type="milestone",
|
||||
topic_id="ca369340-a64e-442e-98f1-a4fa7dc74a38",
|
||||
detail={"workstreams": [...], "tasks_created": M}
|
||||
detail={"workplans": [...], "tasks_created": M}
|
||||
)
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -44,7 +44,7 @@ For each file with `status: ready`, `active`, or `blocked`, note pending
|
|||
|
||||
**Step 4 — Present brief**
|
||||
|
||||
1. **Active workstreams** for `financials` — title, task counts, blocking decisions
|
||||
1. **Active workplans** for `financials` — title, task counts, blocking decisions
|
||||
2. **Pending tasks** from `workplans/` + any `[repo:railiance-cluster]` hub tasks
|
||||
3. **Goal guidance** — if `goal_guidance` in summary:
|
||||
- `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*
|
||||
|
|
@ -52,33 +52,42 @@ For each file with `status: ready`, `active`, or `blocked`, note pending
|
|||
4. **Suggested next action** — highest-priority open item
|
||||
5. **SBOM status** — flag if `last_sbom_at` is unset for this repo
|
||||
|
||||
If no workstreams: follow First Session Protocol (`first-session.md`).
|
||||
If no workplans: follow First Session Protocol (`first-session.md`).
|
||||
|
||||
**During work:** `record_decision()` · `add_progress_event()` · `resolve_decision()`
|
||||
|
||||
> State Hub is a *read model*. Bootstrap tools (`create_workstream`, `create_task`)
|
||||
> are First Session Protocol only. Work structure belongs in repo files (ADR-001).
|
||||
> State Hub is a *read model*. **Never register workplans or tasks by hand**
|
||||
> (`create_workplan`, `create_task`, or the legacy `create_workstream`) — write
|
||||
> the workplan file in `workplans/` and run `fix-consistency`; its C-06 check
|
||||
> registers the workplan and its tasks in the hub and writes the IDs back into
|
||||
> the file. Manual registration creates duplicates the moment fix-consistency
|
||||
> runs. Work structure belongs in repo files (ADR-001).
|
||||
>
|
||||
> Terminology: "workstream" is the legacy name for workplan. Some API/frontmatter
|
||||
> field names keep it for compatibility (`state_hub_workstream_id`,
|
||||
> `workstream_id` params) — treat them as workplan IDs.
|
||||
|
||||
**Session close:**
|
||||
With MCP tools:
|
||||
```
|
||||
add_progress_event(summary="...", topic_id="ca369340-a64e-442e-98f1-a4fa7dc74a38", workstream_id="<uuid>")
|
||||
add_progress_event(summary="...", topic_id="ca369340-a64e-442e-98f1-a4fa7dc74a38", workplan_id="<uuid>")
|
||||
```
|
||||
Without MCP tools:
|
||||
```bash
|
||||
curl -s -X POST http://127.0.0.1:8000/progress/ \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"topic_id":"ca369340-a64e-442e-98f1-a4fa7dc74a38","workstream_id":"<uuid>","event_type":"note","summary":"what changed","author":"codex"}'
|
||||
-d '{"topic_id":"ca369340-a64e-442e-98f1-a4fa7dc74a38","workplan_id":"<uuid>","event_type":"note","summary":"what changed","author":"codex"}'
|
||||
```
|
||||
If workplan files were modified, ensure the local copy is up to date first:
|
||||
If workplan files were modified, ensure the local copy is up to date first,
|
||||
then sync from the repo checkout:
|
||||
```bash
|
||||
git -C <repo_path> pull --ff-only
|
||||
cd ~/state-hub && make fix-consistency REPO=railiance-cluster
|
||||
git pull --ff-only
|
||||
statehub fix-consistency
|
||||
```
|
||||
For repos where implementation runs on a remote machine (e.g. CoulombCore),
|
||||
use the combined target which pulls before fixing:
|
||||
use the pull-before-fix mode from any shell with the State Hub CLI:
|
||||
```bash
|
||||
cd ~/state-hub && make fix-consistency-remote REPO=railiance-cluster
|
||||
statehub fix-consistency --repo railiance-cluster --remote
|
||||
```
|
||||
**C-15** (DB task ahead of file) is normal in multi-machine workflows — writeback
|
||||
will sync the file to match DB. **C-16** (repo behind remote) blocks all writes
|
||||
|
|
|
|||
|
|
@ -5,7 +5,7 @@ ID prefix: `RAIL-BS-WP-`
|
|||
|
||||
Work items originate as files in this repo **before** being registered in the hub.
|
||||
|
||||
Canonical workplan/workstream frontmatter statuses are:
|
||||
Canonical workplan frontmatter statuses are:
|
||||
`proposed`, `ready`, `active`, `blocked`, `backlog`, `finished`, `archived`.
|
||||
Use `proposed` for a newly drafted plan, `ready` after review against current
|
||||
repo state, and `finished` when implementation is complete. `stalled` and
|
||||
|
|
@ -16,14 +16,15 @@ prefix: `YYMMDD-RAIL-BS-WP-NNNN-<slug>.md`. The frontmatter id remains
|
|||
unchanged; the prefix is only for quick visual reference.
|
||||
|
||||
Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
|
||||
`workplans/ADHOC-YYYY-MM-DD.md`, workstream slug `adhoc-YYYY-MM-DD`, and task ids
|
||||
`workplans/ADHOC-YYYY-MM-DD.md`, workplan slug `adhoc-YYYY-MM-DD`, and task ids
|
||||
`ADHOC-YYYY-MM-DD-T01`, `T02`, etc. Use adhocs only for low-risk work completed
|
||||
directly. Promote anything requiring analysis, design, approval, dependencies, or
|
||||
multiple planned phases into a normal workplan.
|
||||
|
||||
Ecosystem todos from other agents arrive as `[repo:railiance-cluster]` hub tasks —
|
||||
visible at session start. Pick one up by creating the workplan file, then registering
|
||||
the workstream.
|
||||
visible at session start. Pick one up by creating the workplan file, committing,
|
||||
and running `statehub fix-consistency` — C-06 registers the workplan in the hub.
|
||||
Never register by hand with `create_workplan`/`create_workstream`.
|
||||
|
||||
Task blocks use this shape:
|
||||
|
||||
|
|
@ -37,4 +38,8 @@ state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
|
|||
Status progression is `todo` → `progress` → `done`; use `wait` for waiting or
|
||||
blocked work and `cancel` for stopped work.
|
||||
|
||||
Workplan frontmatter carries `state_hub_workstream_id` — a legacy field name
|
||||
kept for compatibility ("workstream" is the old term for workplan); it holds
|
||||
the hub workplan id and is written by fix-consistency. Do not edit or rename it.
|
||||
|
||||
<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->
|
||||
|
|
|
|||
|
|
@ -2,7 +2,7 @@
|
|||
# Custodian Brief — railiance-cluster
|
||||
|
||||
**Domain:** financials
|
||||
**Last synced:** 2026-07-01 22:04 UTC
|
||||
**Last synced:** 2026-07-02 09:53 UTC
|
||||
**State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)*
|
||||
|
||||
## Current Goal
|
||||
|
|
@ -11,39 +11,6 @@ Install k3s and Kubernetes Baseline on the HostEurope Server
|
|||
|
||||
## Active Workstreams
|
||||
|
||||
### activity-core no-restart admin-sync smoke (ACTIVITY-WP-0012-T05)
|
||||
Progress: 0/1 done | workstream_id: `2c9e8e96-ec6a-433c-9e6d-0efbcd18679e`
|
||||
|
||||
**Open tasks:**
|
||||
- ! Run the no-restart admin-sync smoke `60f3387d`
|
||||
|
||||
### activity-core WP-0016 triage-output robustness deploy
|
||||
Progress: 0/4 done | workstream_id: `7cbbe0d6-fea9-41c6-840c-46d0d8e8edde`
|
||||
|
||||
**Open tasks:**
|
||||
- · Deploy activity-core with coupled schema and executor `079e39a9`
|
||||
- · Update daily-statehub-wsjf-triage runtime-bundle Instruction `129fb472`
|
||||
- · Pull raw llm-connect response for the 2026-06-26 run `59559f1d`
|
||||
- · Acceptance smoke `8096621a`
|
||||
|
||||
### activity-core WP-0016 triage-output robustness deploy
|
||||
Progress: no tasks done | workstream_id: `5032c55c-2ee2-4b7e-b1eb-157f0f8ac647`
|
||||
|
||||
### activity-core WP-0016 triage-output robustness deploy
|
||||
Progress: 0/4 done | workstream_id: `f2ca1a5d-4dd6-42ea-8003-969c7265f891`
|
||||
|
||||
**Open tasks:**
|
||||
- · Update daily-statehub-wsjf-triage runtime-bundle Instruction (RAIL-BS-WP-0008-T02) `2338d061`
|
||||
- · Deploy activity-core with coupled schema and executor (RAIL-BS-WP-0008-T01) `1ea0945a`
|
||||
- · Pull raw llm-connect response for 2026-06-26 run (RAIL-BS-WP-0008-T03) `b799917b`
|
||||
- · Acceptance smoke: daily-triage clean or graceful degrade (RAIL-BS-WP-0008-T04) `e267a366`
|
||||
|
||||
### activity-core no-restart admin-sync smoke (ACTIVITY-WP-0012-T05)
|
||||
Progress: 0/1 done | workstream_id: `366eec46-3139-4810-ace6-ea75750fe821`
|
||||
|
||||
**Open tasks:**
|
||||
- · Run no-restart admin-sync smoke with Temporal schedule verification (RAIL-BS-WP-0009-T01) `ffe665ce`
|
||||
|
||||
### ThreePhoenix - HA Cluster Implementation
|
||||
Progress: 0/7 done | workstream_id: `9e208376-23f1-40c7-9813-fac1f7d6ad3b`
|
||||
|
||||
|
|
|
|||
29
.forgejo/workflows/ci-smoke.yaml
Normal file
29
.forgejo/workflows/ci-smoke.yaml
Normal file
|
|
@ -0,0 +1,29 @@
|
|||
# Canonical CI smoke template (tier 1 routing drill).
|
||||
# Copy to: .forgejo/workflows/ci-smoke.yaml in consumer repos.
|
||||
name: CI Smoke
|
||||
|
||||
on:
|
||||
push:
|
||||
branches:
|
||||
- main
|
||||
workflow_dispatch:
|
||||
|
||||
jobs:
|
||||
host-smoke:
|
||||
runs-on: self-hosted
|
||||
steps:
|
||||
- name: Routing probe (host runner)
|
||||
run: |
|
||||
set -eu
|
||||
echo "repository=${GITHUB_REPOSITORY:-unknown}"
|
||||
echo "sha=${GITHUB_SHA:-unknown}"
|
||||
echo "runner=${RUNNER_NAME:-unknown}"
|
||||
uname -a
|
||||
|
||||
container-smoke:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Routing probe (container label)
|
||||
run: |
|
||||
set -eu
|
||||
echo "container-smoke ok for ${GITHUB_REPOSITORY:-unknown}"
|
||||
26
AGENTS.md
26
AGENTS.md
|
|
@ -20,6 +20,12 @@ there is no MCP server for Codex agents.
|
|||
|---------|-----|
|
||||
| Local workstation | `http://127.0.0.1:8000` |
|
||||
| Remote via tunnel | `http://127.0.0.1:18000` |
|
||||
| Optional local edge relay | http://127.0.0.1:18080 |
|
||||
|
||||
When an operator has enabled the edge relay, set API_BASE to the relay URL.
|
||||
Queueable writes return an explicit queued receipt if the central hub is
|
||||
unreachable. Treat that as pending local evidence, then ask the operator to run
|
||||
statehub outbox status/replay after connectivity returns.
|
||||
|
||||
### Orient at session start
|
||||
|
||||
|
|
@ -27,8 +33,8 @@ there is no MCP server for Codex agents.
|
|||
# Offline brief — works without hub connection
|
||||
cat .custodian-brief.md
|
||||
|
||||
# Active workstreams for this domain
|
||||
curl -s "http://127.0.0.1:8000/workstreams/?topic_id=ca369340-a64e-442e-98f1-a4fa7dc74a38&status=active" \
|
||||
# Active workplans for this domain
|
||||
curl -s "http://127.0.0.1:8000/workplans/?topic_id=ca369340-a64e-442e-98f1-a4fa7dc74a38&status=active" \
|
||||
| python3 -m json.tool
|
||||
|
||||
# Check inbox
|
||||
|
|
@ -51,12 +57,12 @@ curl -s -X POST http://127.0.0.1:8000/progress/ \
|
|||
"summary": "what was done",
|
||||
"event_type": "note",
|
||||
"author": "codex",
|
||||
"workstream_id": "<uuid>",
|
||||
"workplan_id": "<uuid>",
|
||||
"task_id": "<uuid>"
|
||||
}'
|
||||
```
|
||||
|
||||
Omit `workstream_id` / `task_id` when not applicable.
|
||||
Omit `workplan_id` / `task_id` when not applicable.
|
||||
|
||||
### Update task status
|
||||
|
||||
|
|
@ -80,7 +86,7 @@ curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
|
|||
## Session Protocol
|
||||
|
||||
**Start:**
|
||||
1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
|
||||
1. `cat .custodian-brief.md` — domain goal and open workplans (offline-safe)
|
||||
2. Check inbox: `GET /messages/?to_agent=railiance-cluster&unread_only=true`; mark read
|
||||
3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
|
||||
4. Check human-needed tasks: `GET /tasks/?needs_human=true`
|
||||
|
|
@ -92,12 +98,12 @@ curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
|
|||
**Close:**
|
||||
1. Update workplan file task statuses to reflect progress
|
||||
2. Log: `POST /progress/` with a summary of what changed
|
||||
3. Note for the custodian operator: after workplan file changes, run from
|
||||
`~/state-hub`:
|
||||
3. After workplan file changes, run:
|
||||
```bash
|
||||
make fix-consistency REPO=railiance-cluster
|
||||
statehub fix-consistency
|
||||
```
|
||||
This syncs task status from files into the hub DB.
|
||||
Coding agents should run this directly; ask the operator only if the CLI or
|
||||
State Hub API is unavailable. This syncs task status from files into the hub DB.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -123,7 +129,7 @@ Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run wa
|
|||
| Agent runtime | How to orient |
|
||||
| --- | --- |
|
||||
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=railiance-cluster` is for coordination, not secret vending |
|
||||
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
|
||||
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workplans; **still** use `warden route` for credential ownership |
|
||||
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
|
||||
|
||||
### Quick routing table
|
||||
|
|
|
|||
8
Makefile
8
Makefile
|
|
@ -30,6 +30,12 @@ verify-activity-core: ## Reconcile activity-core runtime and verify disabled ops
|
|||
reconcile-activity-core-llm-connect: ## Reconcile activity-core llm-connect URL and run non-secret gate checks
|
||||
tools/cmd/railiance-reconcile-activity-core-llm-connect
|
||||
|
||||
deploy-activity-core-triage-robustness: ## Deploy ACTIVITY-WP-0016 bundle and prove daily-triage output validation
|
||||
tools/cmd/railiance-deploy-activity-core-triage-robustness
|
||||
|
||||
admin-sync-smoke: ## Run activity-core no-restart POST /admin/sync smoke
|
||||
tools/cmd/railiance-admin-sync-smoke
|
||||
|
||||
##@ Help
|
||||
|
||||
help: ## Show this help
|
||||
|
|
@ -37,4 +43,4 @@ help: ## Show this help
|
|||
/^[a-zA-Z_-]+:.*?##/ { printf " \033[36m%-20s\033[0m %s\n", $$1, $$2 } \
|
||||
/^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) }' $(MAKEFILE_LIST)
|
||||
|
||||
.PHONY: backup restore preflight k3s-install smoke test-ha-failover verify-activity-core reconcile-activity-core-llm-connect help
|
||||
.PHONY: backup restore preflight k3s-install smoke test-ha-failover verify-activity-core reconcile-activity-core-llm-connect deploy-activity-core-triage-robustness admin-sync-smoke help
|
||||
|
|
|
|||
|
|
@ -22,6 +22,10 @@ Commands:
|
|||
observe Plan/run Stage 2 observation checks
|
||||
promote Plan/apply Stage 3 stable promotion
|
||||
rollback Plan/apply rollback to previous stable
|
||||
deploy-triage-robustness
|
||||
Deploy ACTIVITY-WP-0016 and prove daily-triage validation
|
||||
admin-sync-smoke
|
||||
Run activity-core no-restart POST /admin/sync smoke
|
||||
build-spore Build a distributable "Spore" bundle
|
||||
seed-local Run the seed script on this machine
|
||||
checklist Pre-VM checklist
|
||||
|
|
@ -51,6 +55,8 @@ case "$cmd" in
|
|||
observe) exec railiance-stage2 observe "$@" ;;
|
||||
promote) exec railiance-stage3 promote "$@" ;;
|
||||
rollback) exec railiance-stage3 rollback "$@" ;;
|
||||
deploy-triage-robustness) exec railiance-deploy-activity-core-triage-robustness "$@" ;;
|
||||
admin-sync-smoke) exec railiance-admin-sync-smoke "$@" ;;
|
||||
build-spore) bash "$ROOT/tools/build_spore.sh" ;;
|
||||
seed-local) bash "$ROOT/tools/seed_node.sh" ;;
|
||||
checklist)
|
||||
|
|
|
|||
|
|
@ -21,6 +21,8 @@ mode are denied these by the permission classifier — that is intentional.
|
|||
| `make test-ha-failover` | kills the primary PG pod to assert recovery |
|
||||
| `make verify-activity-core` | reconciles activity-core runtime on railiance01 |
|
||||
| `make reconcile-activity-core-llm-connect` | patches ConfigMap, applies llm-connect overlay, runs smoke pod |
|
||||
| `make deploy-activity-core-triage-robustness` | deploys ACTIVITY-WP-0016 code/schema/runtime as a coupled bundle and triggers daily triage |
|
||||
| `make admin-sync-smoke` | calls activity-core `POST /admin/sync` and proves worker pod identity/restart count did not change |
|
||||
|
||||
## Read-only / safe targets
|
||||
|
||||
|
|
@ -33,3 +35,8 @@ Reconcile/verify targets post non-secret evidence notes to the State Hub
|
|||
(`STATE_HUB_EVIDENCE_WORKSTREAM_ID` / `STATE_HUB_EVIDENCE_TASK_ID` env vars
|
||||
attach them to a workstream/task). Never record Secret values — key counts
|
||||
and readiness states only.
|
||||
|
||||
For `make admin-sync-smoke`, set `ACTIVITY_CORE_ADMIN_SYNC_FIXTURE_COMMAND`
|
||||
when you need a specific enabled-flip/rename fixture before the sync call. The
|
||||
command records whether a fixture ran; leaving it unset proves endpoint and
|
||||
no-restart behavior only.
|
||||
155
tools/cmd/railiance-admin-sync-smoke
Executable file
155
tools/cmd/railiance-admin-sync-smoke
Executable file
|
|
@ -0,0 +1,155 @@
|
|||
#!/usr/bin/env bash
|
||||
# Prove POST /admin/sync works without restarting the activity-core worker.
|
||||
set -euo pipefail
|
||||
|
||||
NAMESPACE="${ACTIVITY_CORE_NAMESPACE:-activity-core}"
|
||||
CLUSTER_HOST="${ACTIVITY_CORE_CLUSTER_HOST:-railiance01}"
|
||||
STATE_HUB_URL="${STATE_HUB_URL:-http://127.0.0.1:8000}"
|
||||
ACTIVITY_CORE_ALLOW_LOCAL_KUBECTL="${ACTIVITY_CORE_ALLOW_LOCAL_KUBECTL:-0}"
|
||||
ACTIVITY_CORE_ADMIN_SYNC_FIXTURE_COMMAND="${ACTIVITY_CORE_ADMIN_SYNC_FIXTURE_COMMAND:-}"
|
||||
ACTIVITY_CORE_ADMIN_SYNC_REQUIRE_FIXTURE="${ACTIVITY_CORE_ADMIN_SYNC_REQUIRE_FIXTURE:-0}"
|
||||
EVIDENCE_WORKSTREAM_ID="${STATE_HUB_EVIDENCE_WORKSTREAM_ID:-2c9e8e96-ec6a-433c-9e6d-0efbcd18679e}"
|
||||
EVIDENCE_TASK_ID="${STATE_HUB_EVIDENCE_TASK_ID:-60f3387d-3d14-42a9-b8a3-725a86468510}"
|
||||
|
||||
STARTED_AT="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
|
||||
CURRENT_GATE=startup
|
||||
BEFORE_JSON=""
|
||||
AFTER_JSON=""
|
||||
FIXTURE_STATUS=skipped
|
||||
SYNC_RESPONSE_JSON=""
|
||||
EVIDENCE_NOTE_JSON=""
|
||||
|
||||
export NAMESPACE CLUSTER_HOST STATE_HUB_URL EVIDENCE_WORKSTREAM_ID EVIDENCE_TASK_ID
|
||||
export STARTED_AT BEFORE_JSON AFTER_JSON FIXTURE_STATUS SYNC_RESPONSE_JSON
|
||||
|
||||
log() { printf '[activity-core-admin-sync-smoke] %s\n' "$*"; }
|
||||
quote() { printf '%q' "$1"; }
|
||||
cluster_bash() { if [[ -n "$CLUSTER_HOST" ]]; then ssh "$CLUSTER_HOST" "bash -s" <<<"$1"; else bash -s <<<"$1"; fi; }
|
||||
|
||||
post_evidence() {
|
||||
local status="$1" failing_gate="${2:-}"
|
||||
export EVIDENCE_STATUS="$status" FAILING_GATE="$failing_gate"
|
||||
python3 - <<'PY'
|
||||
import json, os, sys, urllib.request
|
||||
|
||||
def env_json(name):
|
||||
raw = os.environ.get(name, "")
|
||||
if not raw:
|
||||
return None
|
||||
try:
|
||||
return json.loads(raw)
|
||||
except json.JSONDecodeError:
|
||||
return {"raw": raw}
|
||||
|
||||
status = os.environ["EVIDENCE_STATUS"]
|
||||
failing_gate = os.environ.get("FAILING_GATE") or None
|
||||
detail = {
|
||||
"producer": "railiance-cluster",
|
||||
"verification": "activity-core no-restart admin sync smoke",
|
||||
"status": status,
|
||||
"failing_gate": failing_gate,
|
||||
"cluster_host": os.environ.get("CLUSTER_HOST") or "local-kubectl",
|
||||
"namespace": os.environ.get("NAMESPACE"),
|
||||
"worker_before": env_json("BEFORE_JSON"),
|
||||
"worker_after": env_json("AFTER_JSON"),
|
||||
"fixture_status": os.environ.get("FIXTURE_STATUS"),
|
||||
"sync_response": env_json("SYNC_RESPONSE_JSON"),
|
||||
"started_at": os.environ.get("STARTED_AT"),
|
||||
}
|
||||
summary = (
|
||||
"Railiance activity-core no-restart admin-sync smoke passed: POST /admin/sync returned expected counters and worker pod identity/restart count stayed stable."
|
||||
if status == "passed"
|
||||
else "Railiance activity-core no-restart admin-sync smoke failed" + (f" at {failing_gate}" if failing_gate else "") + "; see non-secret evidence detail."
|
||||
)
|
||||
payload = {"summary": summary, "event_type": "note", "author": "railiance-cluster", "detail": detail}
|
||||
if os.environ.get("EVIDENCE_WORKSTREAM_ID"):
|
||||
payload["workstream_id"] = os.environ["EVIDENCE_WORKSTREAM_ID"]
|
||||
if os.environ.get("EVIDENCE_TASK_ID"):
|
||||
payload["task_id"] = os.environ["EVIDENCE_TASK_ID"]
|
||||
req = urllib.request.Request(os.environ["STATE_HUB_URL"].rstrip("/") + "/progress/", data=json.dumps(payload).encode(), headers={"Content-Type": "application/json"}, method="POST")
|
||||
with urllib.request.urlopen(req, timeout=20) as resp:
|
||||
sys.stdout.write(resp.read().decode())
|
||||
PY
|
||||
}
|
||||
|
||||
on_error() { local code=$?; trap - ERR; post_evidence failed "$CURRENT_GATE" >/dev/null || true; exit "$code"; }
|
||||
trap on_error ERR
|
||||
|
||||
if [[ "$CLUSTER_HOST" == local ]]; then
|
||||
[[ "$ACTIVITY_CORE_ALLOW_LOCAL_KUBECTL" == 1 ]] || { echo 'ACTIVITY_CORE_CLUSTER_HOST=local requires ACTIVITY_CORE_ALLOW_LOCAL_KUBECTL=1' >&2; exit 2; }
|
||||
CLUSTER_HOST=""
|
||||
fi
|
||||
export CLUSTER_HOST
|
||||
|
||||
CURRENT_GATE='cluster executor preflight'
|
||||
log "using cluster executor: ${CLUSTER_HOST:-local kubectl}"
|
||||
cluster_bash 'set -euo pipefail; command -v kubectl >/dev/null; command -v python3 >/dev/null'
|
||||
|
||||
worker_snapshot_script='import json,sys
|
||||
items=json.load(sys.stdin).get("items",[])
|
||||
if not items: raise SystemExit("no actcore-worker pods found")
|
||||
pod=sorted(items,key=lambda item:item["metadata"]["name"])[0]
|
||||
container=pod["status"]["containerStatuses"][0]
|
||||
print(json.dumps({"name":pod["metadata"]["name"],"uid":pod["metadata"]["uid"],"phase":pod["status"].get("phase"),"restart_count":container.get("restartCount",0),"image":container.get("image"),"image_id":container.get("imageID")}, sort_keys=True))'
|
||||
|
||||
CURRENT_GATE='worker baseline capture'
|
||||
BEFORE_JSON="$(cluster_bash "kubectl -n $(quote "$NAMESPACE") get pod -l app.kubernetes.io/name=actcore-worker -o json | python3 -c $(quote "$worker_snapshot_script")")"
|
||||
export BEFORE_JSON
|
||||
|
||||
CURRENT_GATE='admin sync fixture'
|
||||
if [[ -n "$ACTIVITY_CORE_ADMIN_SYNC_FIXTURE_COMMAND" ]]; then
|
||||
log 'running operator-supplied fixture command'
|
||||
cluster_bash "$ACTIVITY_CORE_ADMIN_SYNC_FIXTURE_COMMAND"
|
||||
FIXTURE_STATUS=ran
|
||||
elif [[ "$ACTIVITY_CORE_ADMIN_SYNC_REQUIRE_FIXTURE" == 1 ]]; then
|
||||
echo 'ACTIVITY_CORE_ADMIN_SYNC_REQUIRE_FIXTURE=1 but no fixture command was supplied' >&2
|
||||
exit 2
|
||||
else
|
||||
FIXTURE_STATUS=skipped
|
||||
fi
|
||||
export FIXTURE_STATUS
|
||||
|
||||
CURRENT_GATE='POST /admin/sync'
|
||||
log 'calling POST /admin/sync?definitions=true&schedules=true'
|
||||
SYNC_RESPONSE_JSON="$(
|
||||
cluster_bash "$(cat <<EOF
|
||||
set -euo pipefail
|
||||
kubectl -n $(quote "$NAMESPACE") exec -i deploy/actcore-api -- python - <<'PY'
|
||||
import json, urllib.request
|
||||
req = urllib.request.Request('http://localhost:8010/admin/sync?definitions=true&schedules=true', method='POST')
|
||||
with urllib.request.urlopen(req, timeout=60) as resp:
|
||||
payload = json.loads(resp.read().decode())
|
||||
required = [('definitions','synced'),('schedules','upserted'),('schedules','paused'),('schedules','deleted_orphans'),('errors',None)]
|
||||
for section, key in required:
|
||||
if section not in payload:
|
||||
raise SystemExit(f'missing sync response section {section!r}')
|
||||
if key is not None and key not in payload[section]:
|
||||
raise SystemExit(f'missing sync response key {section}.{key}')
|
||||
if payload.get('errors'):
|
||||
raise SystemExit('admin sync returned errors: ' + json.dumps(payload['errors']))
|
||||
print(json.dumps(payload, sort_keys=True))
|
||||
PY
|
||||
EOF
|
||||
)"
|
||||
)"
|
||||
export SYNC_RESPONSE_JSON
|
||||
|
||||
CURRENT_GATE='worker no-restart verification'
|
||||
AFTER_JSON="$(cluster_bash "kubectl -n $(quote "$NAMESPACE") get pod -l app.kubernetes.io/name=actcore-worker -o json | python3 -c $(quote "$worker_snapshot_script")")"
|
||||
python3 - <<'PY'
|
||||
import json, os
|
||||
before = json.loads(os.environ['BEFORE_JSON'])
|
||||
after = json.loads(os.environ['AFTER_JSON'])
|
||||
if before['uid'] != after['uid']:
|
||||
raise SystemExit(f"worker pod changed uid: {before['uid']} -> {after['uid']}")
|
||||
if before['restart_count'] != after['restart_count']:
|
||||
raise SystemExit(f"worker restart count changed: {before['restart_count']} -> {after['restart_count']}")
|
||||
PY
|
||||
export AFTER_JSON
|
||||
|
||||
CURRENT_GATE='State Hub evidence note'
|
||||
log 'posting non-secret evidence note to State Hub'
|
||||
EVIDENCE_NOTE_JSON="$(post_evidence passed '')"
|
||||
trap - ERR
|
||||
log 'verification passed'
|
||||
printf '%s\n' "$EVIDENCE_NOTE_JSON"
|
||||
263
tools/cmd/railiance-deploy-activity-core-triage-robustness
Executable file
263
tools/cmd/railiance-deploy-activity-core-triage-robustness
Executable file
|
|
@ -0,0 +1,263 @@
|
|||
#!/usr/bin/env bash
|
||||
# Deploy ACTIVITY-WP-0016 code/schema/runtime together and prove daily-triage output.
|
||||
set -euo pipefail
|
||||
|
||||
NAMESPACE="${ACTIVITY_CORE_NAMESPACE:-activity-core}"
|
||||
CLUSTER_HOST="${ACTIVITY_CORE_CLUSTER_HOST:-railiance01}"
|
||||
STATE_HUB_URL="${STATE_HUB_URL:-http://127.0.0.1:8000}"
|
||||
ACTIVITY_CORE_REPO="${ACTIVITY_CORE_REPO:-/home/worsch/activity-core}"
|
||||
ACTIVITY_CORE_REMOTE_REPO="${ACTIVITY_CORE_REMOTE_REPO:-}"
|
||||
ACTIVITY_CORE_ALLOW_LOCAL_KUBECTL="${ACTIVITY_CORE_ALLOW_LOCAL_KUBECTL:-0}"
|
||||
ACTIVITY_CORE_SYNC_RUNTIME_BUNDLE="${ACTIVITY_CORE_SYNC_RUNTIME_BUNDLE:-auto}"
|
||||
ACTIVITY_CORE_RESTART_DEPLOYMENTS="${ACTIVITY_CORE_RESTART_DEPLOYMENTS:-1}"
|
||||
REQUIRED_ACTIVITY_CORE_REV="${REQUIRED_ACTIVITY_CORE_REV:-bf877b7}"
|
||||
DAILY_TRIAGE_DEFINITION_SLUG="${DAILY_TRIAGE_DEFINITION_SLUG:-daily-statehub-wsjf-triage}"
|
||||
STATE_HUB_PROGRESS_TIMEOUT_SECONDS="${STATE_HUB_PROGRESS_TIMEOUT_SECONDS:-240}"
|
||||
STATE_HUB_PROGRESS_POLL_SECONDS="${STATE_HUB_PROGRESS_POLL_SECONDS:-5}"
|
||||
EVIDENCE_WORKSTREAM_ID="${STATE_HUB_EVIDENCE_WORKSTREAM_ID:-7cbbe0d6-fea9-41c6-840c-46d0d8e8edde}"
|
||||
EVIDENCE_TASK_ID="${STATE_HUB_EVIDENCE_TASK_ID:-8096621a-54ee-4be5-943e-5dc2da19ed28}"
|
||||
|
||||
STARTED_AT="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
|
||||
CURRENT_GATE=startup
|
||||
REMOTE_REVISION=""
|
||||
CONTRACT_JSON=""
|
||||
API_IMAGE=""
|
||||
API_IMAGE_ID=""
|
||||
WORKER_IMAGE=""
|
||||
WORKER_IMAGE_ID=""
|
||||
SYNC_STATUS_JSON=""
|
||||
TRIGGER_JSON=""
|
||||
DEFINITION_ID=""
|
||||
TRIGGER_KEY=""
|
||||
EXPECTED_RUN_ID=""
|
||||
PROGRESS_JSON=""
|
||||
|
||||
export NAMESPACE CLUSTER_HOST STATE_HUB_URL ACTIVITY_CORE_REMOTE_REPO REQUIRED_ACTIVITY_CORE_REV
|
||||
export DAILY_TRIAGE_DEFINITION_SLUG STARTED_AT EVIDENCE_WORKSTREAM_ID EVIDENCE_TASK_ID
|
||||
export STATE_HUB_PROGRESS_TIMEOUT_SECONDS STATE_HUB_PROGRESS_POLL_SECONDS
|
||||
export REMOTE_REVISION CONTRACT_JSON API_IMAGE API_IMAGE_ID WORKER_IMAGE WORKER_IMAGE_ID
|
||||
export SYNC_STATUS_JSON TRIGGER_JSON DEFINITION_ID TRIGGER_KEY EXPECTED_RUN_ID PROGRESS_JSON
|
||||
|
||||
log() { printf '[activity-core-triage-robustness] %s\n' "$*"; }
|
||||
quote() { printf '%q' "$1"; }
|
||||
cluster_bash() { if [[ -n "$CLUSTER_HOST" ]]; then ssh "$CLUSTER_HOST" "bash -s" <<<"$1"; else bash -s <<<"$1"; fi; }
|
||||
|
||||
should_sync_runtime_bundle() {
|
||||
case "$ACTIVITY_CORE_SYNC_RUNTIME_BUNDLE" in
|
||||
1|true|yes) return 0 ;;
|
||||
0|false|no) return 1 ;;
|
||||
auto) [[ -n "$CLUSTER_HOST" && -d "$ACTIVITY_CORE_REPO/k8s/railiance" ]]; return ;;
|
||||
*) printf 'invalid ACTIVITY_CORE_SYNC_RUNTIME_BUNDLE=%s\n' "$ACTIVITY_CORE_SYNC_RUNTIME_BUNDLE" >&2; exit 2 ;;
|
||||
esac
|
||||
}
|
||||
|
||||
post_evidence() {
|
||||
local status="$1" failing_gate="${2:-}"
|
||||
export EVIDENCE_STATUS="$status" FAILING_GATE="$failing_gate"
|
||||
python3 - <<'PY'
|
||||
import json, os, sys, urllib.request
|
||||
|
||||
def env_json(name):
|
||||
raw = os.environ.get(name, "")
|
||||
if not raw:
|
||||
return None
|
||||
try:
|
||||
return json.loads(raw)
|
||||
except json.JSONDecodeError:
|
||||
return {"raw": raw}
|
||||
|
||||
status = os.environ["EVIDENCE_STATUS"]
|
||||
failing_gate = os.environ.get("FAILING_GATE") or None
|
||||
detail = {
|
||||
"producer": "railiance-cluster",
|
||||
"verification": "activity-core WP-0016 coupled deploy and daily-triage smoke",
|
||||
"status": status,
|
||||
"failing_gate": failing_gate,
|
||||
"cluster_host": os.environ.get("CLUSTER_HOST") or "local-kubectl",
|
||||
"namespace": os.environ.get("NAMESPACE"),
|
||||
"activity_core_repo": os.environ.get("ACTIVITY_CORE_REMOTE_REPO"),
|
||||
"required_activity_core_revision": os.environ.get("REQUIRED_ACTIVITY_CORE_REV"),
|
||||
"activity_core_revision": os.environ.get("REMOTE_REVISION") or None,
|
||||
"runtime_bundle": "k8s/railiance/20-runtime.yaml",
|
||||
"runtime_contract": env_json("CONTRACT_JSON"),
|
||||
"sync_job": env_json("SYNC_STATUS_JSON"),
|
||||
"api_image": os.environ.get("API_IMAGE") or None,
|
||||
"api_image_id": os.environ.get("API_IMAGE_ID") or None,
|
||||
"worker_image": os.environ.get("WORKER_IMAGE") or None,
|
||||
"worker_image_id": os.environ.get("WORKER_IMAGE_ID") or None,
|
||||
"definition_slug": os.environ.get("DAILY_TRIAGE_DEFINITION_SLUG"),
|
||||
"definition_id": os.environ.get("DEFINITION_ID") or None,
|
||||
"manual_trigger": env_json("TRIGGER_JSON"),
|
||||
"expected_activity_core_run_id": os.environ.get("EXPECTED_RUN_ID") or None,
|
||||
"state_hub_progress": env_json("PROGRESS_JSON"),
|
||||
"started_at": os.environ.get("STARTED_AT"),
|
||||
}
|
||||
summary = (
|
||||
"Railiance activity-core WP-0016 deploy/smoke passed: code/schema and bounded runtime contract were reconciled together, daily triage was triggered, and State Hub recorded schema-valid output."
|
||||
if status == "passed"
|
||||
else "Railiance activity-core WP-0016 deploy/smoke failed" + (f" at {failing_gate}" if failing_gate else "") + "; see non-secret evidence detail."
|
||||
)
|
||||
payload = {"summary": summary, "event_type": "note", "author": "railiance-cluster", "detail": detail}
|
||||
if os.environ.get("EVIDENCE_WORKSTREAM_ID"):
|
||||
payload["workstream_id"] = os.environ["EVIDENCE_WORKSTREAM_ID"]
|
||||
if os.environ.get("EVIDENCE_TASK_ID"):
|
||||
payload["task_id"] = os.environ["EVIDENCE_TASK_ID"]
|
||||
req = urllib.request.Request(os.environ["STATE_HUB_URL"].rstrip("/") + "/progress/", data=json.dumps(payload).encode(), headers={"Content-Type": "application/json"}, method="POST")
|
||||
with urllib.request.urlopen(req, timeout=20) as resp:
|
||||
sys.stdout.write(resp.read().decode())
|
||||
PY
|
||||
}
|
||||
|
||||
on_error() { local code=$?; trap - ERR; post_evidence failed "$CURRENT_GATE" >/dev/null || true; exit "$code"; }
|
||||
trap on_error ERR
|
||||
|
||||
if [[ "$CLUSTER_HOST" == local ]]; then
|
||||
[[ "$ACTIVITY_CORE_ALLOW_LOCAL_KUBECTL" == 1 ]] || { echo 'ACTIVITY_CORE_CLUSTER_HOST=local requires ACTIVITY_CORE_ALLOW_LOCAL_KUBECTL=1' >&2; exit 2; }
|
||||
CLUSTER_HOST=""
|
||||
fi
|
||||
if [[ -z "$ACTIVITY_CORE_REMOTE_REPO" ]]; then
|
||||
if [[ -n "$CLUSTER_HOST" ]]; then ACTIVITY_CORE_REMOTE_REPO="$(ssh "$CLUSTER_HOST" pwd)/activity-core"; else ACTIVITY_CORE_REMOTE_REPO="$ACTIVITY_CORE_REPO"; fi
|
||||
fi
|
||||
export CLUSTER_HOST ACTIVITY_CORE_REMOTE_REPO
|
||||
|
||||
CURRENT_GATE='cluster executor preflight'
|
||||
log "using cluster executor: ${CLUSTER_HOST:-local kubectl}"
|
||||
cluster_bash 'set -euo pipefail; command -v kubectl >/dev/null; command -v python3 >/dev/null'
|
||||
|
||||
CURRENT_GATE='runtime bundle sync'
|
||||
if should_sync_runtime_bundle; then
|
||||
log "syncing runtime bundle to ${CLUSTER_HOST}:${ACTIVITY_CORE_REMOTE_REPO}/k8s/railiance"
|
||||
ssh "$CLUSTER_HOST" "mkdir -p $(quote "$ACTIVITY_CORE_REMOTE_REPO")/k8s/railiance"
|
||||
rsync -a --delete "$ACTIVITY_CORE_REPO/k8s/railiance/" "${CLUSTER_HOST}:${ACTIVITY_CORE_REMOTE_REPO}/k8s/railiance/"
|
||||
fi
|
||||
|
||||
CURRENT_GATE='activity-core revision gate'
|
||||
REMOTE_REVISION="$(cluster_bash "set -euo pipefail; git -C $(quote "$ACTIVITY_CORE_REMOTE_REPO") rev-parse --short HEAD; git -C $(quote "$ACTIVITY_CORE_REMOTE_REPO") merge-base --is-ancestor $(quote "$REQUIRED_ACTIVITY_CORE_REV") HEAD")"
|
||||
export REMOTE_REVISION
|
||||
|
||||
CURRENT_GATE='runtime contract gate'
|
||||
CONTRACT_JSON="$(
|
||||
cluster_bash "$(cat <<EOF
|
||||
set -euo pipefail
|
||||
python3 - $(quote "$ACTIVITY_CORE_REMOTE_REPO")/k8s/railiance/20-runtime.yaml <<'PY'
|
||||
import json, re, sys
|
||||
text = open(sys.argv[1], encoding='utf-8').read()
|
||||
lower = text.lower()
|
||||
max_tokens = [int(v) for v in re.findall(r"max_tokens\s*[:=]\s*['\"]?(\d+)", text)]
|
||||
checks = {
|
||||
'mentions_daily_instruction': 'daily-statehub-wsjf-triage' in lower,
|
||||
'bounded_top_7': bool(re.search(r'(top[- ]?7|<=\s*7|≤\s*7|at most\s+7|no more than\s+7)', lower)),
|
||||
'fewer_well_formed': 'fewer well-formed' in lower,
|
||||
'ndjson_or_line_framing': 'ndjson' in lower or 'one recommendation json object per line' in lower,
|
||||
'max_tokens_headroom': bool(max_tokens and max(max_tokens) >= 1800),
|
||||
}
|
||||
missing = [name for name, ok in checks.items() if not ok]
|
||||
print(json.dumps({'path': sys.argv[1], 'max_tokens': max_tokens, 'checks': checks, 'missing': missing}, sort_keys=True))
|
||||
if missing:
|
||||
raise SystemExit('runtime bundle contract checks failed: ' + ', '.join(missing))
|
||||
PY
|
||||
EOF
|
||||
)"
|
||||
)"
|
||||
export CONTRACT_JSON
|
||||
|
||||
CURRENT_GATE='runtime bundle reconcile'
|
||||
log 'applying runtime bundle and restarting activity-core deployments'
|
||||
cluster_bash "set -euo pipefail
|
||||
kubectl apply -f $(quote "$ACTIVITY_CORE_REMOTE_REPO")/k8s/railiance/00-namespace.yaml
|
||||
kubectl -n $(quote "$NAMESPACE") delete job actcore-migrate actcore-sync --ignore-not-found
|
||||
kubectl apply -f $(quote "$ACTIVITY_CORE_REMOTE_REPO")/k8s/railiance/20-runtime.yaml
|
||||
if [[ $(quote "$ACTIVITY_CORE_RESTART_DEPLOYMENTS") == 1 ]]; then kubectl -n $(quote "$NAMESPACE") rollout restart deploy/actcore-api deploy/actcore-worker deploy/actcore-event-router; fi
|
||||
kubectl -n $(quote "$NAMESPACE") wait --for=condition=complete job/actcore-migrate --timeout=180s
|
||||
kubectl -n $(quote "$NAMESPACE") rollout status deploy/actcore-api --timeout=180s
|
||||
kubectl -n $(quote "$NAMESPACE") rollout status deploy/actcore-worker --timeout=180s
|
||||
kubectl -n $(quote "$NAMESPACE") rollout status deploy/actcore-event-router --timeout=180s
|
||||
kubectl -n $(quote "$NAMESPACE") wait --for=condition=complete job/actcore-sync --timeout=180s"
|
||||
|
||||
CURRENT_GATE='runtime status capture'
|
||||
API_IMAGE="$(cluster_bash "kubectl -n $(quote "$NAMESPACE") get deploy actcore-api -o jsonpath='{.spec.template.spec.containers[0].image}'")"
|
||||
API_IMAGE_ID="$(cluster_bash "kubectl -n $(quote "$NAMESPACE") get pod -l app.kubernetes.io/name=actcore-api -o jsonpath='{.items[0].status.containerStatuses[0].imageID}'")"
|
||||
WORKER_IMAGE="$(cluster_bash "kubectl -n $(quote "$NAMESPACE") get deploy actcore-worker -o jsonpath='{.spec.template.spec.containers[0].image}'")"
|
||||
WORKER_IMAGE_ID="$(cluster_bash "kubectl -n $(quote "$NAMESPACE") get pod -l app.kubernetes.io/name=actcore-worker -o jsonpath='{.items[0].status.containerStatuses[0].imageID}'")"
|
||||
SYNC_STATUS_JSON="$(cluster_bash "kubectl -n $(quote "$NAMESPACE") get job actcore-sync -o json" | python3 -c 'import json,sys; j=json.load(sys.stdin); s=j.get("status",{}); print(json.dumps({"name":j["metadata"]["name"],"succeeded":s.get("succeeded",0),"failed":s.get("failed",0),"completion_time":s.get("completionTime")}))')"
|
||||
export API_IMAGE API_IMAGE_ID WORKER_IMAGE WORKER_IMAGE_ID SYNC_STATUS_JSON
|
||||
|
||||
CURRENT_GATE='daily-triage manual trigger'
|
||||
log "triggering ${DAILY_TRIAGE_DEFINITION_SLUG}"
|
||||
TRIGGER_JSON="$(
|
||||
cluster_bash "$(cat <<EOF
|
||||
set -euo pipefail
|
||||
kubectl -n $(quote "$NAMESPACE") exec -i deploy/actcore-api -- python - $(quote "$DAILY_TRIAGE_DEFINITION_SLUG") <<'PY'
|
||||
import json, sys, urllib.request
|
||||
slug = sys.argv[1]
|
||||
with urllib.request.urlopen('http://localhost:8010/activity-definitions/', timeout=30) as resp:
|
||||
definitions = json.load(resp)
|
||||
match = None
|
||||
for definition in definitions:
|
||||
values = [str(definition.get(k) or '') for k in ('slug', 'name', 'id')]
|
||||
if slug in values or any(slug in value for value in values):
|
||||
match = definition
|
||||
break
|
||||
if not match:
|
||||
raise SystemExit(f'definition matching {slug!r} not found')
|
||||
definition_id = match['id']
|
||||
req = urllib.request.Request(f'http://localhost:8010/activity-definitions/{definition_id}/trigger', method='POST')
|
||||
with urllib.request.urlopen(req, timeout=30) as resp:
|
||||
payload = json.loads(resp.read().decode())
|
||||
payload['definition_id'] = definition_id
|
||||
print(json.dumps(payload, sort_keys=True))
|
||||
PY
|
||||
EOF
|
||||
)"
|
||||
)"
|
||||
DEFINITION_ID="$(python3 -c 'import json,os; print(json.loads(os.environ["TRIGGER_JSON"])["definition_id"])')"
|
||||
TRIGGER_KEY="$(python3 -c 'import json,os; t=json.loads(os.environ["TRIGGER_JSON"]); print(t.get("trigger_key") or t.get("workflow_id") or "")')"
|
||||
EXPECTED_RUN_ID="$(python3 - <<'PY'
|
||||
import os, uuid
|
||||
trigger_key = os.environ.get('TRIGGER_KEY')
|
||||
definition_id = os.environ.get('DEFINITION_ID')
|
||||
print(uuid.uuid5(uuid.NAMESPACE_URL, f'{definition_id}:{trigger_key}') if trigger_key else '')
|
||||
PY
|
||||
)"
|
||||
export TRIGGER_JSON DEFINITION_ID TRIGGER_KEY EXPECTED_RUN_ID
|
||||
|
||||
CURRENT_GATE='State Hub daily_triage evidence'
|
||||
log 'polling State Hub for schema-valid daily_triage progress'
|
||||
PROGRESS_JSON="$(python3 - <<'PY'
|
||||
from datetime import datetime
|
||||
import json, os, time, urllib.parse, urllib.request
|
||||
base = os.environ['STATE_HUB_URL'].rstrip('/')
|
||||
started = datetime.fromisoformat(os.environ['STARTED_AT'].replace('Z', '+00:00'))
|
||||
deadline = time.monotonic() + int(os.environ['STATE_HUB_PROGRESS_TIMEOUT_SECONDS'])
|
||||
interval = int(os.environ['STATE_HUB_PROGRESS_POLL_SECONDS'])
|
||||
expected_run_id = os.environ.get('EXPECTED_RUN_ID')
|
||||
url = base + '/progress/?' + urllib.parse.urlencode({'event_type': 'daily_triage'})
|
||||
while time.monotonic() < deadline:
|
||||
with urllib.request.urlopen(url, timeout=20) as resp:
|
||||
events = json.load(resp)
|
||||
for event in events:
|
||||
created_at = datetime.fromisoformat(event['created_at'].replace('Z', '+00:00'))
|
||||
if created_at < started:
|
||||
continue
|
||||
detail = event.get('detail') or {}
|
||||
if expected_run_id and isinstance(detail, dict):
|
||||
run_id = detail.get('activity_core_run_id') or detail.get('run_id')
|
||||
if run_id and run_id != expected_run_id:
|
||||
continue
|
||||
if not isinstance(detail, dict) or detail.get('output_validated') is not True:
|
||||
continue
|
||||
if detail.get('partial') is True and int(detail.get('quarantined_count') or 0) <= 0:
|
||||
continue
|
||||
print(json.dumps({'id': event['id'], 'event_type': event.get('event_type'), 'summary': event.get('summary'), 'author': event.get('author'), 'created_at': event.get('created_at'), 'output_validated': detail.get('output_validated'), 'partial': detail.get('partial'), 'quarantined_count': detail.get('quarantined_count'), 'activity_core_run_id': detail.get('activity_core_run_id'), 'detail_keys': sorted(detail.keys())}))
|
||||
raise SystemExit(0)
|
||||
time.sleep(interval)
|
||||
raise SystemExit('no schema-valid daily_triage progress found')
|
||||
PY
|
||||
)"
|
||||
export PROGRESS_JSON
|
||||
|
||||
CURRENT_GATE='State Hub evidence note'
|
||||
log 'posting non-secret evidence note to State Hub'
|
||||
post_evidence passed ''
|
||||
trap - ERR
|
||||
log 'verification passed'
|
||||
|
|
@ -4,11 +4,12 @@ type: workplan
|
|||
title: "activity-core WP-0016 triage-output robustness deploy"
|
||||
domain: financials
|
||||
repo: railiance-cluster
|
||||
status: ready
|
||||
status: finished
|
||||
owner: railiance-cluster
|
||||
topic_slug: railiance
|
||||
created: "2026-07-01"
|
||||
updated: "2026-07-01"
|
||||
updated: "2026-07-02"
|
||||
state_hub_workstream_id: "7cbbe0d6-fea9-41c6-840c-46d0d8e8edde"
|
||||
---
|
||||
|
||||
# activity-core WP-0016 triage-output robustness deploy
|
||||
|
|
@ -31,20 +32,41 @@ whole-doc validator. It MUST ship together with the new `executor.py`
|
|||
|
||||
```task
|
||||
id: RAIL-BS-WP-0008-T01
|
||||
status: todo
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "079e39a9-f938-4d03-a5bc-4d3d2f7b1d83"
|
||||
```
|
||||
|
||||
Rebuild/import the activity-core image from main (`bf877b7` or later) into
|
||||
the railiance01 k3s runtime and reconcile the activity-core deployment so the
|
||||
new executor and the strict per-item schema ship together.
|
||||
|
||||
2026-07-02: Added `make deploy-activity-core-triage-robustness` /
|
||||
`bin/railiance deploy-triage-robustness` as the repeatable operator path. The
|
||||
command gates the remote activity-core repo on `bf877b7` or later, checks the
|
||||
runtime bundle contract before applying it, restarts the activity-core
|
||||
deployments by default, waits for migrate/sync jobs and rollouts, then records
|
||||
non-secret State Hub evidence. Live execution on railiance01 remains pending.
|
||||
|
||||
2026-07-02 (later session): rebuilt `activity-core:railiance01-prod` locally
|
||||
from activity-core main `7612112` (includes `bf877b7` and the T02 prompt
|
||||
contract). Transfer/import to railiance01 was **blocked by the agent
|
||||
permission policy** (production remote write requires explicit operator
|
||||
authorization). Two preconditions found and fixed/noted: (a) the remote
|
||||
`~/activity-core` copy has no `.git`, so the script's revision gate will fail
|
||||
until the repo is synced with git metadata or `REQUIRED_ACTIVITY_CORE_REV`
|
||||
verification is adapted; (b) the T02 runtime contract is now satisfied in the
|
||||
repo bundle (activity-core commit `7612112`). Operator pickup: run the
|
||||
image save/scp/import from the deploy README, sync the repo with `.git`, then
|
||||
`make deploy-activity-core-triage-robustness`.
|
||||
|
||||
## Update daily-statehub-wsjf-triage runtime-bundle Instruction
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0008-T02
|
||||
status: todo
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "129fb472-41e8-4e5c-bcbb-0995a96e223b"
|
||||
```
|
||||
|
||||
In the runtime projection (not the activity-core repo), update the
|
||||
|
|
@ -58,12 +80,18 @@ In the runtime projection (not the activity-core repo), update the
|
|||
recommendation JSON object per line) so the T03 parser recovers items
|
||||
independently.
|
||||
|
||||
2026-07-02: The new deploy command enforces this contract against
|
||||
`k8s/railiance/20-runtime.yaml` before it will touch the cluster: it requires
|
||||
the daily instruction, a top-7 bound, the "fewer well-formed" fallback, NDJSON
|
||||
or one-object-per-line framing, and `max_tokens` headroom of at least 1800.
|
||||
|
||||
## Pull raw llm-connect response for the 2026-06-26 run
|
||||
|
||||
```task
|
||||
id: RAIL-BS-WP-0008-T03
|
||||
status: todo
|
||||
status: cancel
|
||||
priority: medium
|
||||
state_hub_task_id: "59559f1d-821f-4660-8a7d-c623c6631864"
|
||||
```
|
||||
|
||||
From the llm-connect pod logs / response store on railiance01, capture the
|
||||
|
|
@ -76,8 +104,9 @@ secrets.
|
|||
|
||||
```task
|
||||
id: RAIL-BS-WP-0008-T04
|
||||
status: todo
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "8096621a-54ee-4be5-943e-5dc2da19ed28"
|
||||
```
|
||||
|
||||
Trigger one daily-triage run against the reconciled runtime and confirm it
|
||||
|
|
@ -87,3 +116,36 @@ either (i) returns a clean schema-valid report, or (ii) degrades gracefully
|
|||
shows a matching `daily_triage` progress event. Closes ACTIVITY-WP-0016-T05
|
||||
and unblocks the three-clean-run streak for ACTIVITY-WP-0010-T04 /
|
||||
WP-0006-T03.
|
||||
|
||||
2026-07-02: The deploy command now triggers the daily-triage definition after
|
||||
reconcile and polls State Hub for a post-trigger `daily_triage` event with
|
||||
`output_validated=true`. If the run is partial, it also requires
|
||||
`quarantined_count>0` before posting pass evidence.
|
||||
|
||||
## Completion 2026-07-02
|
||||
|
||||
Deployed live with operator authorization. Image `activity-core:railiance01-prod`
|
||||
rebuilt from main `7612112`, imported into railiance01 k3s
|
||||
(`sha256:550c5592...`), repo synced with git metadata, and
|
||||
`make deploy-activity-core-triage-robustness` applied the coupled
|
||||
schema/executor bundle with all rollouts and migrate/sync jobs green.
|
||||
|
||||
- T01/T02 done: revision gate and runtime contract gate both passed
|
||||
(`bounded_top_7`, `ndjson_or_line_framing`, `fewer_well_formed`,
|
||||
`max_tokens_headroom` >= 1800 all true).
|
||||
- T04 done: manually triggered daily-triage run produced a clean schema-valid
|
||||
report — State Hub event `24d2d321-c761-47f7-bf9e-7950a6253c21`
|
||||
(2026-07-02T09:50:44Z) with `output_validated=true`, exactly 7 ranked
|
||||
recommendations, `working_memory_status=written`, no validation error. The
|
||||
bounded top-7 contract is proven live; the three-clean-run streak for
|
||||
ACTIVITY-WP-0010-T04 / WP-0006-T03 restarts from this run.
|
||||
- T03 cancelled: the raw 2026-06-26 llm-connect response is unrecoverable —
|
||||
the llm-connect pod is stateless (no volumes, no response store) and its
|
||||
log stream contains only 2 startup lines from 2026-06-19. Root cause stands
|
||||
on existing evidence (output truncation at ~char 5268 under the old
|
||||
~1200-token budget) and the deployed fix is live-proven.
|
||||
- Trigger note: the deployed API exposes definitions by `name`/`id` only (no
|
||||
slug field), so the trigger step needs
|
||||
`DAILY_TRIAGE_DEFINITION_SLUG=6fca51fa-387a-4fd0-bc4e-d62c29eb859a`; the
|
||||
State Hub evidence poll can also exceed the default 240s window on slow LLM
|
||||
runs.
|
||||
|
|
|
|||
|
|
@ -4,11 +4,11 @@ type: workplan
|
|||
title: "activity-core no-restart admin-sync smoke (ACTIVITY-WP-0012-T05)"
|
||||
domain: financials
|
||||
repo: railiance-cluster
|
||||
status: active
|
||||
status: finished
|
||||
owner: railiance-cluster
|
||||
topic_slug: railiance
|
||||
created: "2026-07-01"
|
||||
updated: "2026-07-01"
|
||||
updated: "2026-07-02"
|
||||
state_hub_workstream_id: "2c9e8e96-ec6a-433c-9e6d-0efbcd18679e"
|
||||
---
|
||||
|
||||
|
|
@ -30,7 +30,7 @@ The deploy precondition is covered by RAIL-BS-WP-0008-T01 (main at
|
|||
|
||||
```task
|
||||
id: RAIL-BS-WP-0009-T01
|
||||
status: wait
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "60f3387d-3d14-42a9-b8a3-725a86468510"
|
||||
```
|
||||
|
|
@ -46,3 +46,25 @@ After RAIL-BS-WP-0008-T01 is deployed, without restarting the worker:
|
|||
5. Record non-secret evidence in the State Hub. Response JSON should include
|
||||
`definitions.synced`, `schedules.upserted`, `schedules.paused`,
|
||||
`schedules.deleted_orphans`, and `errors[]`.
|
||||
|
||||
2026-07-02: Added `make admin-sync-smoke` / `bin/railiance admin-sync-smoke`
|
||||
as the repeatable operator path. It captures the worker pod UID/restart count,
|
||||
optionally runs an operator-supplied enabled-flip/rename fixture via
|
||||
`ACTIVITY_CORE_ADMIN_SYNC_FIXTURE_COMMAND`, calls
|
||||
`POST /admin/sync?definitions=true&schedules=true`, verifies the expected
|
||||
response counters and empty `errors[]`, rechecks that the same worker pod did
|
||||
not restart, and posts non-secret State Hub evidence. T01 stays `wait` until
|
||||
RAIL-BS-WP-0008-T01 is deployed and the smoke is run on railiance01.
|
||||
|
||||
## Completion 2026-07-02
|
||||
|
||||
`make admin-sync-smoke` passed against the freshly deployed
|
||||
RAIL-BS-WP-0008 runtime: `POST /admin/sync?definitions=true&schedules=true`
|
||||
returned `ok=true` with `definitions.synced=6`, `schedules.upserted=4`,
|
||||
`schedules.paused=2`, `deleted_orphans=0`, empty `errors[]`, and the worker
|
||||
pod identity (`actcore-worker-5b78f85b76-ng54t`, restart_count 0) was
|
||||
unchanged before and after — proving no-restart admin sync. Non-secret
|
||||
evidence: State Hub event `4caa288d-830b-4348-9cff-b2d5855cd42d`. The
|
||||
optional enabled-flip fixture was skipped (no operator fixture supplied);
|
||||
schedule pause/upsert semantics were exercised by the sync counters. Closes
|
||||
ACTIVITY-WP-0012-T05.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue