Test Capsule: The Per-Test Proof Contract That Ends "Trust Me, We Ran It"

1. The scanner-report epistemology problem

Every scanner produces a report. Most reports say things like "443 tests passed, 12 failed, 3 flagged for review." The problem with that report is that it is unverifiable from the outside. Three months later, when an auditor or a customer or your own incident-response team asks "did test ECOM-RED-052 run on host X at 03:14 UTC on May 12?", the report cannot answer. The scanner has aged out the per-test artefacts, or never wrote them down, or wrote them down in a format nobody can replay. The verdict survives; the evidence does not.

We built the Proof Capsule framework to solve this for findings. A Proof Capsule packages a single finding's PoC, expected output, replay script, and signed metadata into one artifact that can be re-executed against the customer's environment months later and produce the same result. It is the contract that turns a finding into a fact.

The Test Capsule is the same contract applied one level deeper, to every test, not just every finding. PASS results matter as much as FAIL results, because a PASS today is the baseline against which a regression tomorrow gets measured. The Test Capsule schema is what makes "we ran this test against this asset and it returned this result" an auditable claim instead of a verbal assurance.

2. Schema

The on-disk shape is small, on purpose. Anything bigger gets dropped on the floor by storage rotation policy.

# test_capsule.yaml — schema v1
capsule_id: TC-ECOM-RED-052-20260512T031400Z-7a2cfbd8
test_id: ECOM-RED-052
test_version: 3.1.0
catalogue_revision: 6280

target:
  asset: storefront.example-customer.internal
  asset_fingerprint_sha256: 5f8a...
  scan_run_id: SR-20260512-031200
  engagement: eng_7a2cfbd8df87       # opaque code per OPSEC policy

execution:
  started_at: 2026-05-12T03:14:00.412Z
  completed_at: 2026-05-12T03:14:01.879Z
  duration_ms: 1467
  scanner_handler: core.scanners.ecom.red.bola_idor:probe
  scanner_pid: 28194
  scanner_image: celvex-dev@sha256:9b1634...

inputs:
  endpoint: /api/v1/orders/{order_id}
  method: GET
  payload_class: idor-numeric-enum
  payload_sample: order_id=10047 (true) vs order_id=10048 (cross-tenant)
  auth_context: customer-a-session
  rate_limit_window_ms: 800

observations:
  request_count: 2
  response_status_codes: [200, 200]
  response_size_bytes: [3812, 3817]
  decision_signal: cross_tenant_record_returned
  decision_rule: "response_2.tenant_id != session.tenant_id AND status_2 == 200"

verdict:
  result: FAIL
  severity: high
  cwe: CWE-639
  finding_id: CELVEX-2026-05-12-ECOM-052-eng7a2cfbd8

evidence_refs:
  - s3://celvex-proof-capsules/2026/05/12/SR-20260512-031200/TC-ECOM-RED-052/request_1.har
  - s3://celvex-proof-capsules/2026/05/12/SR-20260512-031200/TC-ECOM-RED-052/request_2.har
  - s3://celvex-proof-capsules/2026/05/12/SR-20260512-031200/TC-ECOM-RED-052/decision_trace.jsonl

signature:
  algorithm: ed25519_local
  key_id: celvex-scanner-pool-2026Q2
  signature: base64(...)
  signed_at: 2026-05-12T03:14:01.901Z

The fields fall into five groups: identity (what test ran and which version of it), target (which asset, which scan run), execution (when, by what process, how long), observations (the raw signals the test extracted), verdict (the decision and its provenance), and signature (who signed the result and with what key).

Notice what is absent. There is no marketing copy. There is no remediation advice. There is no severity narrative. Those belong in the Proof Capsule, which is the customer-facing artifact. The Test Capsule is the evidence under the Proof Capsule: terse, machine-readable, and signed.

3. The decision-rule field is load-bearing

The single most important field is decision_rule. It is the small predicate that turned the observations into the verdict. In our ECOM-RED-052 example the rule is response_2.tenant_id != session.tenant_id AND status_2 == 200. That is exactly what a human re-running the test six months from now will check. The rule is what tells the auditor whether the test detected the real bug or got lucky.

A test that reports a verdict without a decision rule is reporting a vibe. We refuse to ship those. Our no_scanner_stubs pre-commit guard catches new handlers that lack an explicit decision rule before they land on main; the guard is small (~2 seconds, AST-only) and runs on every commit that touches core/scanners/ or core/wave3_scanning/. It rejects any handler whose body collapses to a return PASS or return FAIL with no observable signal extraction in between.

4. Storage and rotation

Capsules are heavier than verdicts. A scan run that touches 600 assets across 200 tests produces roughly 120,000 capsules. Storing them all forever is not free.

Our policy is tiered:

Eviction from hot to warm is a fixed daily job. Eviction from warm to cold is a fixed weekly job. Both jobs verify the signature on the capsule before moving it; a capsule whose signature has been tampered with gets quarantined to a separate forensics/ prefix and a Sev-1 alert fires. We have not had a quarantine event in production. We have had three in chaos-engineering drills, all of which the pipeline caught.

5. Why per-test, not per-finding

This is the question we get asked most often, and the answer is regression detection.

Most scanners only persist evidence for FAIL results. The reasoning is intuitive: a PASS has no finding, so why store the evidence? The answer is that a PASS today is the baseline a regression has to break. If the scanner says ECOM-RED-052 = PASS on May 12 and ECOM-RED-052 = FAIL on May 19, the evidence the team needs is the PASS capsule from May 12. That is the artifact that tells the team whether the test changed, the asset changed, the network changed, or the threat changed. Without the PASS capsule, the team is guessing.

We learned this the hard way in early 2026. A customer-side WAF rule change started silently flipping a CSRF test from FAIL to PASS, the test was getting blocked at the WAF before reaching the application. Without the per-test capsules, we would have read the PASS as good news and moved on. With them, the network-layer trace inside the capsule showed a 403 Forbidden at the edge, the decision rule downgraded the verdict to INCONCLUSIVE, and the customer's team rolled back the WAF rule and re-ran the scan against the now-reachable application. The application was still vulnerable. Without the capsule, we would have missed it for the entire WAF policy lifetime.

6. The audit-readiness consequence

A Test Capsule store with a 7-year retention policy converts the scanner from an operational tool into an evidentiary one. Customers in regulated industries (financial services, healthcare, public sector) have asked us for exactly this for a year. The PCI auditor does not want to know that you have a scanner. They want to know what the scanner saw, when, on which assets, with which rule logic, and they want a signed artifact that proves it. The Test Capsule is that artifact.

The same capsule satisfies internal change-management. When a developer ships a fix and the engagement re-runs the affected tests, the new capsules link back to the old ones via test_id and asset_fingerprint_sha256. The dashboard shows the verdict transition, the signature lineage, and the duration delta. The fix is verified by evidence, not by ticket comments.

We ship every customer their Test Capsules. We always have. There is no premium tier that withholds them. Per-test signed evidence is the contract we want every scanner vendor to offer; until then, it is the contract you should ask yours to start writing.

Verifiable security.