2026-07-02 00:02:36 +02:00
|
|
|
|
---
|
|
|
|
|
|
id: RAIL-BS-WP-0008
|
|
|
|
|
|
type: workplan
|
|
|
|
|
|
title: "activity-core WP-0016 triage-output robustness deploy"
|
|
|
|
|
|
domain: financials
|
|
|
|
|
|
repo: railiance-cluster
|
2026-07-02 11:53:11 +02:00
|
|
|
|
status: finished
|
2026-07-02 00:02:36 +02:00
|
|
|
|
owner: railiance-cluster
|
|
|
|
|
|
topic_slug: railiance
|
|
|
|
|
|
created: "2026-07-01"
|
2026-07-02 10:44:06 +02:00
|
|
|
|
updated: "2026-07-02"
|
2026-07-02 00:25:42 +02:00
|
|
|
|
state_hub_workstream_id: "7cbbe0d6-fea9-41c6-840c-46d0d8e8edde"
|
2026-07-02 00:02:36 +02:00
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
# activity-core WP-0016 triage-output robustness deploy
|
|
|
|
|
|
|
|
|
|
|
|
## Context
|
|
|
|
|
|
|
|
|
|
|
|
Inbox message `87952ff1` (activity-core, 2026-06-26): the scheduled daily WSJF
|
|
|
|
|
|
triage run on 2026-06-26 failed schema validation and the whole run was
|
|
|
|
|
|
discarded, resetting the WP-0006-T03 three-clean-run streak. ACTIVITY-WP-0016
|
|
|
|
|
|
hardened the instruction-executor output contract in-repo (commits
|
|
|
|
|
|
`5eb33bd..bf877b7` on activity-core main, 220 tests passed). The remaining
|
|
|
|
|
|
work is operator/cluster-owned on railiance01.
|
|
|
|
|
|
|
|
|
|
|
|
**Deploy coupling constraint:** `schemas/daily-triage-report.json` is now
|
|
|
|
|
|
strict per-item and is consumed by both the llm-connect hint and the
|
|
|
|
|
|
whole-doc validator. It MUST ship together with the new `executor.py`
|
|
|
|
|
|
(T03 per-item quarantine parser). Never deploy the schema ahead of the code.
|
|
|
|
|
|
|
|
|
|
|
|
## Deploy activity-core with coupled schema and executor
|
|
|
|
|
|
|
|
|
|
|
|
```task
|
|
|
|
|
|
id: RAIL-BS-WP-0008-T01
|
2026-07-02 11:53:11 +02:00
|
|
|
|
status: done
|
2026-07-02 00:02:36 +02:00
|
|
|
|
priority: high
|
2026-07-02 00:25:42 +02:00
|
|
|
|
state_hub_task_id: "079e39a9-f938-4d03-a5bc-4d3d2f7b1d83"
|
2026-07-02 00:02:36 +02:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Rebuild/import the activity-core image from main (`bf877b7` or later) into
|
|
|
|
|
|
the railiance01 k3s runtime and reconcile the activity-core deployment so the
|
|
|
|
|
|
new executor and the strict per-item schema ship together.
|
|
|
|
|
|
|
2026-07-02 10:44:06 +02:00
|
|
|
|
2026-07-02: Added `make deploy-activity-core-triage-robustness` /
|
|
|
|
|
|
`bin/railiance deploy-triage-robustness` as the repeatable operator path. The
|
|
|
|
|
|
command gates the remote activity-core repo on `bf877b7` or later, checks the
|
|
|
|
|
|
runtime bundle contract before applying it, restarts the activity-core
|
|
|
|
|
|
deployments by default, waits for migrate/sync jobs and rollouts, then records
|
|
|
|
|
|
non-secret State Hub evidence. Live execution on railiance01 remains pending.
|
|
|
|
|
|
|
2026-07-02 10:47:40 +02:00
|
|
|
|
2026-07-02 (later session): rebuilt `activity-core:railiance01-prod` locally
|
|
|
|
|
|
from activity-core main `7612112` (includes `bf877b7` and the T02 prompt
|
|
|
|
|
|
contract). Transfer/import to railiance01 was **blocked by the agent
|
|
|
|
|
|
permission policy** (production remote write requires explicit operator
|
|
|
|
|
|
authorization). Two preconditions found and fixed/noted: (a) the remote
|
|
|
|
|
|
`~/activity-core` copy has no `.git`, so the script's revision gate will fail
|
|
|
|
|
|
until the repo is synced with git metadata or `REQUIRED_ACTIVITY_CORE_REV`
|
|
|
|
|
|
verification is adapted; (b) the T02 runtime contract is now satisfied in the
|
|
|
|
|
|
repo bundle (activity-core commit `7612112`). Operator pickup: run the
|
|
|
|
|
|
image save/scp/import from the deploy README, sync the repo with `.git`, then
|
|
|
|
|
|
`make deploy-activity-core-triage-robustness`.
|
|
|
|
|
|
|
2026-07-02 00:02:36 +02:00
|
|
|
|
## Update daily-statehub-wsjf-triage runtime-bundle Instruction
|
|
|
|
|
|
|
|
|
|
|
|
```task
|
|
|
|
|
|
id: RAIL-BS-WP-0008-T02
|
2026-07-02 11:53:11 +02:00
|
|
|
|
status: done
|
2026-07-02 00:02:36 +02:00
|
|
|
|
priority: high
|
2026-07-02 00:25:42 +02:00
|
|
|
|
state_hub_task_id: "129fb472-41e8-4e5c-bcbb-0995a96e223b"
|
2026-07-02 00:02:36 +02:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
In the runtime projection (not the activity-core repo), update the
|
|
|
|
|
|
`daily-statehub-wsjf-triage` Instruction:
|
|
|
|
|
|
|
|
|
|
|
|
- raise `max_tokens` (currently ~1200; give clear headroom above the
|
|
|
|
|
|
~1300–1500-token 16-workstream list);
|
|
|
|
|
|
- prompt: bounded top-N (≤7) ranked recommendations, "if uncertain emit fewer
|
|
|
|
|
|
well-formed items rather than more";
|
|
|
|
|
|
- prompt: per-item NDJSON framing (leading summary object, then one
|
|
|
|
|
|
recommendation JSON object per line) so the T03 parser recovers items
|
|
|
|
|
|
independently.
|
|
|
|
|
|
|
2026-07-02 10:44:06 +02:00
|
|
|
|
2026-07-02: The new deploy command enforces this contract against
|
|
|
|
|
|
`k8s/railiance/20-runtime.yaml` before it will touch the cluster: it requires
|
|
|
|
|
|
the daily instruction, a top-7 bound, the "fewer well-formed" fallback, NDJSON
|
|
|
|
|
|
or one-object-per-line framing, and `max_tokens` headroom of at least 1800.
|
|
|
|
|
|
|
2026-07-02 00:02:36 +02:00
|
|
|
|
## Pull raw llm-connect response for the 2026-06-26 run
|
|
|
|
|
|
|
|
|
|
|
|
```task
|
|
|
|
|
|
id: RAIL-BS-WP-0008-T03
|
2026-07-02 11:53:11 +02:00
|
|
|
|
status: cancel
|
2026-07-02 00:02:36 +02:00
|
|
|
|
priority: medium
|
2026-07-02 00:25:42 +02:00
|
|
|
|
state_hub_task_id: "59559f1d-821f-4660-8a7d-c623c6631864"
|
2026-07-02 00:02:36 +02:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
From the llm-connect pod logs / response store on railiance01, capture the
|
|
|
|
|
|
full raw response and `finish_reason` for the 2026-06-26 05:20:57Z run
|
|
|
|
|
|
(activity-core retained only a 4000-char preview; the JSON break is at char
|
|
|
|
|
|
5268). Send to activity-core to close ACTIVITY-WP-0016-T01. Logs only, no
|
|
|
|
|
|
secrets.
|
|
|
|
|
|
|
|
|
|
|
|
## Acceptance smoke
|
|
|
|
|
|
|
|
|
|
|
|
```task
|
|
|
|
|
|
id: RAIL-BS-WP-0008-T04
|
2026-07-02 11:53:11 +02:00
|
|
|
|
status: done
|
2026-07-02 00:02:36 +02:00
|
|
|
|
priority: high
|
2026-07-02 00:25:42 +02:00
|
|
|
|
state_hub_task_id: "8096621a-54ee-4be5-943e-5dc2da19ed28"
|
2026-07-02 00:02:36 +02:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Trigger one daily-triage run against the reconciled runtime and confirm it
|
|
|
|
|
|
either (i) returns a clean schema-valid report, or (ii) degrades gracefully
|
|
|
|
|
|
(valid recommendations with `output_validated=true`, `partial=true`,
|
|
|
|
|
|
`quarantined_count>0`) instead of discarding the run. Confirm the State Hub
|
|
|
|
|
|
shows a matching `daily_triage` progress event. Closes ACTIVITY-WP-0016-T05
|
|
|
|
|
|
and unblocks the three-clean-run streak for ACTIVITY-WP-0010-T04 /
|
|
|
|
|
|
WP-0006-T03.
|
2026-07-02 10:44:06 +02:00
|
|
|
|
|
|
|
|
|
|
2026-07-02: The deploy command now triggers the daily-triage definition after
|
|
|
|
|
|
reconcile and polls State Hub for a post-trigger `daily_triage` event with
|
|
|
|
|
|
`output_validated=true`. If the run is partial, it also requires
|
2026-07-02 11:53:11 +02:00
|
|
|
|
`quarantined_count>0` before posting pass evidence.
|
|
|
|
|
|
|
|
|
|
|
|
## Completion 2026-07-02
|
|
|
|
|
|
|
|
|
|
|
|
Deployed live with operator authorization. Image `activity-core:railiance01-prod`
|
|
|
|
|
|
rebuilt from main `7612112`, imported into railiance01 k3s
|
|
|
|
|
|
(`sha256:550c5592...`), repo synced with git metadata, and
|
|
|
|
|
|
`make deploy-activity-core-triage-robustness` applied the coupled
|
|
|
|
|
|
schema/executor bundle with all rollouts and migrate/sync jobs green.
|
|
|
|
|
|
|
|
|
|
|
|
- T01/T02 done: revision gate and runtime contract gate both passed
|
|
|
|
|
|
(`bounded_top_7`, `ndjson_or_line_framing`, `fewer_well_formed`,
|
|
|
|
|
|
`max_tokens_headroom` >= 1800 all true).
|
|
|
|
|
|
- T04 done: manually triggered daily-triage run produced a clean schema-valid
|
|
|
|
|
|
report — State Hub event `24d2d321-c761-47f7-bf9e-7950a6253c21`
|
|
|
|
|
|
(2026-07-02T09:50:44Z) with `output_validated=true`, exactly 7 ranked
|
|
|
|
|
|
recommendations, `working_memory_status=written`, no validation error. The
|
|
|
|
|
|
bounded top-7 contract is proven live; the three-clean-run streak for
|
|
|
|
|
|
ACTIVITY-WP-0010-T04 / WP-0006-T03 restarts from this run.
|
|
|
|
|
|
- T03 cancelled: the raw 2026-06-26 llm-connect response is unrecoverable —
|
|
|
|
|
|
the llm-connect pod is stateless (no volumes, no response store) and its
|
|
|
|
|
|
log stream contains only 2 startup lines from 2026-06-19. Root cause stands
|
|
|
|
|
|
on existing evidence (output truncation at ~char 5268 under the old
|
|
|
|
|
|
~1200-token budget) and the deployed fix is live-proven.
|
|
|
|
|
|
- Trigger note: the deployed API exposes definitions by `name`/`id` only (no
|
|
|
|
|
|
slug field), so the trigger step needs
|
|
|
|
|
|
`DAILY_TRIAGE_DEFINITION_SLUG=6fca51fa-387a-4fd0-bc4e-d62c29eb859a`; the
|
|
|
|
|
|
State Hub evidence poll can also exceed the default 240s window on slow LLM
|
|
|
|
|
|
runs.
|