A Coverage Gap Is a Finding: Why "Manual Proof On-Demand" Buckets Are Where Real Risk Hides

1. The bucket where bugs go to wait

Open any commercial vulnerability scanner's documentation and look for the phrase "manual proof", "out-of-band check", "requires interactive testing", or "expert mode". You will find a section explaining that certain checks cannot be automated and are therefore available on a manual-trigger basis. The section is usually short, framed as a feature, and presented as the responsible choice. The reality is different. That bucket is the largest single source of false-negative findings in modern scanning, and most teams using the scanner never realise it exists.

We had the same bucket. Six months ago it held 3,429 tests. The bucket was justified internally with the same arguments commercial scanners use: these tests are noisy, these tests need credentials, these tests are slow, these tests need a human to interpret. Every argument was real. Every argument was also a coverage gap dressed up as a process control.

The customer-side reality of the bucket is brutal. A customer pays for continuous scanning of their attack surface. The scanner runs daily, weekly, or per-deploy. The customer reads the dashboard, sees no critical findings, and concludes the surface is clean. The bucket sits unread on the analyst's screen because nobody has the operational cycles to dispatch 3,429 manual tests per asset per week. The bugs in the bucket do not vanish. They are visible to anyone with the attack-surface knowledge to look for them. They are invisible to the customer who is paying us to look first.

2. The policy change

The founder's directive in late 2025 was a single sentence: every test that can run must run on every continuous-monitoring scan. There is no "manual proof, on-demand" bucket. The only tests that skip are the ones that are physically unable to execute against the customer's exposed surface, internal-only services, credentialled paths the customer has not authorised, segments the scanner cannot route to. Each skip ships with an explicit skip_reason field. Coverage is measured against the entire catalogue, not against the subset the engineer felt comfortable automating.

The policy looks obvious on paper. The engineering work to honour it took six months, twenty-two engineers across four squads, and the worst kind of regression triage, tests that had been silently broken for years because nobody was running them. The remainder of this article is what we learned during that work.

3. The four classes of "cannot automate" we had to break

We catalogued every test in the manual bucket against the actual reason it was sitting there. Four classes accounted for ninety-six percent of the bucket.

Class 1, "Needs credentials I don't have"

These were tests that required authenticated execution: admin panels, internal APIs, post-login workflows. The original argument was that we did not have the credentials, so the test could not run. The unstated assumption was that the customer would never give us the credentials.

The unstated assumption was wrong. Customers routinely provision us with scoped credentials for internal scanning when we ask. The reason we did not have them was that we did not ask. We added a credential-discovery step to onboarding (the customer fills in a single form with the in-scope authentication contexts they want us to test) and refactored ninety percent of the bucket out within four weeks.

The tests that still cannot get credentials, usually because the credentialled surface is in a security boundary the customer is unwilling to extend, now report skip_reason: customer_no_credential_grant with an explicit follow-up ticket back to the customer's account team. The customer sees the gap in the dashboard. They choose to close it or accept it. The decision is made with evidence in front of it.

Class 2, "Generates noise the customer's SOC will complain about"

These were tests that produced fuzzing-class traffic: dirbusters, parameter sprayers, payload brute-forcers. The original argument was that the customer's SOC would alert on the scan and complain. The unstated assumption was that the alerts were unactionable noise.

The unstated assumption was wrong. The customer's SOC alerting on our scanner is itself a finding, either the SOC is alerting on benign traffic patterns we should have whitelisted, or the SOC has missed the original onboarding-allowlist email and is treating our IP space as hostile. Both are operational defects that customer security teams have asked us to surface explicitly. We added a scan_acknowledged_by_soc field to every engagement and surface SOC-side false positives back to the customer's IR lead within four hours of the scan. The noise stopped being a problem the day we started measuring it.

Class 3, "Takes too long to fit in a scan window"

These were slow tests: long-poll timing oracles, multi-stage chain probes, cache-population probes that need to wait for an eviction interval. The original argument was that the scan-window budget could not accommodate them. The unstated assumption was that the scan window was a fixed resource.

The unstated assumption was wrong. We re-architected the worker pool to support long-running tests on a separate queue, parallel to the fast-test queue. A slow test now runs to completion in its own worker without blocking the fast-test sweep. The slow-test results land in the same dashboard. The customer never sees the queue distinction.

The hard engineering was the bounded-budget logic, a slow test cannot be allowed to run forever. We adopted a per-test budget contract: every test declares its expected duration and a hard ceiling. The scheduler enforces the ceiling and records partial: true, reason: timeout, last_observation: <whatever the test had collected up to the cutoff> when the ceiling fires. The partial result is a finding too, the customer learns that the test could not complete, and the test owner gets a backlog ticket to investigate why.

Per a load-bearing internal lesson, we never disable a test because it timed out on a few customers. The right response is to investigate, instrument, and improve, not to remove. Other customers may benefit.

Class 4, "Requires human interpretation of the result"

These were tests that produced ambiguous outputs, heuristics that needed a human to look at a payload-trigger response and decide whether it counted. The original argument was that automation would either ship false positives or miss real findings. The unstated assumption was that the human's interpretation was reliable.

The unstated assumption was wrong. We sampled the historical manual-interpretation outcomes against the underlying evidence and found a 32 percent inter-analyst disagreement rate on the same payload-response pair. The humans were not consistent enough to be the ground truth. We migrated these tests to an AI triage step, the same payload-response pair gets evaluated against a Test Capsule decision rule with an explicit confidence score. Below-threshold confidence is routed to a human reviewer with a structured rubric and a measured agreement rate that has held above 91 percent across the last quarter.

The rubric is published. Every reviewer follows it. The reviewer's decision is logged into the Test Capsule. The decision is auditable. The decision is also reversible, a customer who disagrees with a finding can request a second-reviewer audit through their portal, and we re-run the rubric with a different reviewer pair.

4. The numbers after six months

The opening bucket size was 3,429 tests. The closing bucket size, as of the most recent catalogue revision, is 412. The remaining 412 fall into the legitimate INTERNAL-INACCESSIBLE category, they cannot run because the customer's surface does not expose the relevant endpoint, full stop. Each ships with an explicit skip_reason. Each is visible in the dashboard's coverage view as a gap the customer can close if they want to.

The catalogue itself grew from 5,305 tests to 6,280 over the same period as new attack research landed. The coverage rate, tests running against eligible assets divided by tests applicable to the asset's fingerprint, climbed from 64.7 percent to 96.9 percent. The number of findings climbed too, which was the entire point, the bugs were always there.

Customer reaction was overwhelmingly positive once we explained the change. A few asked us to throttle the new coverage to avoid alert fatigue; we honoured the request by surfacing the findings inside a "newly-covered tests" bucket on the dashboard for two scan cycles before promoting them into the main queue. After two cycles, the bucket emptied and the findings landed in the regular triage flow.

5. The lesson for any scanner architecture

If your scanner has a manual bucket, the bucket is your coverage gap. Treat the size of the bucket as a metric and publish it. Treat the rate at which the bucket shrinks as a roadmap. Treat any new test that lands in the bucket as a regression, the question is not "can a human run this manually?" but "what stops us from running this automatically on every scan?".

The answer is almost always one of the four classes above. The fix is almost always engineering rather than process. The work is not glamorous; it is auth-flow plumbing, queue-architecture work, AI-rubric tuning, and customer-communication. The outcome is a scanner whose dashboard reflects the surface as it actually is, not as the bucket-keeper wanted it to be.

Bad actors do not respect the bucket. They run every test, on every surface, every time. The only scanner architecture that competes is one that does the same.

Verifiable security.