Detection Windows: How to Measure (and Close) the Hours Your Defenders Are Asleep

1. Coverage % is a snapshot. Coverage WHEN is the truth.

Every vendor pitch deck has the same chart now: a heatmap of MITRE ATT&CK techniques, green for "covered," yellow for "partial," red for "not covered," and a percentage at the bottom. 67%. 71%. 84%. The number is offered as if it answers the question "are we safe?"

It doesn't. It answers a different question: "given infinite observation time and a stationary attacker, would my detection stack eventually fire on this technique?" Neither condition holds in production. Classes have cadences. Rules reload. Feeds go stale. Hunts run on schedules. The coverage matrix is a snapshot taken at the moment everything happens to be running -- a moment that may represent 6% of the day. The other 94%, you operate under a partial matrix no one has drawn.

This post is about the time domain: the six detection classes we audit, why a compound blackout (three or more classes simultaneously blind) gives an attacker a quasi-perfect operational window, the sweep-line algorithm we use to compute those windows in O(N log N), and the twelve ENDPOINT-WINDOW-NNN probes we shipped this week. Coverage % is a trailing indicator unless you also publish coverage when.

2. Six detection classes, six sleep schedules

The Detection Window Auditor models defender telemetry as six coarse classes. Each class has its own cadence, its own blackout shape, and its own attacker exposure profile. The taxonomy is deliberately coarse -- collapsing dozens of products into six buckets is what makes timeline arithmetic tractable.

FIM (file integrity monitoring). Wazuh <syscheck>, AIDE, Samhain, OSSEC. Default cadences range from 12 hours (Wazuh out-of-the-box) to 24-48 hours when operators stagger sensitive trees to manage CPU. Realtime/inotify mode is opt-in per directory; most customers leave content roots in scheduled mode. An adversary who has fingerprinted the cadence -- trivial via packet timing on a single rented host whose schedule mirrors the customer's -- writes modifications inside the longest blind window and the next FIM cycle records only the post-modification state as "normal." This is the persistence sweet spot.

Syscall (eBPF / auditd / kAudit / driver-based). Falco, auditd, Tetragon, commercial EDR. Continuous in steady state, but every reload introduces a tear-down/rebuild blackout: 2-5 seconds for a Falco SIGHUP, 50-300 ms for a 200-rule auditd auditctl -R. Boot-up is its own blackout: if falco.service is not ordered Before=ssh.service, an attacker connecting in the boot-up interval gets 5-30 seconds of syscall blindness. Short, sharp, and triggerable -- operators reload daemons on every config-management convergence.

Vuln-feed. NVD-anchored feeds plus vendor-specific channels (Wazuh CTI, RHSA, ALAS, MSU). The detection window for a CVE is bounded below by (NVD-publish to feed-fetch) + (feed-fetch to scanner-run). Wazuh's vuln-detector defaults to a 1-hour fetch interval; many deployments raise this to 24 hours and never look again. Long and asymmetric -- weaponisation of new CVEs is measured in hours, but a stale feed produces a 48-hour-plus invisibility window covering an entire patch-Tuesday.

Cloud. CloudTrail, Azure Activity Logs, GCP Pub/Sub via Wazuh's wodle-aws / wodle-azure-logs / wodle-gcp-pubsub. Total observed lag is CloudTrail S3 delivery (5-minute SLA, 15-minute p99) plus wodle poll interval (10 minutes default) plus decoder pipeline. p99 cloud detection lag is frequently 25-30 minutes. Large enough that "attacker on keyboard" scenarios -- assumed-role abuse, bucket policy escalation, IAM persistence -- complete before the first telemetry record reaches the SIEM.

Hunt. Velociraptor hunts and GRR Foreman dispatches. Pull-driven, schedule-bound. GRR Foreman's client_min_poll_interval defaults to 600 seconds; hunt frequencies of 24 hours are common for forensic baselines. Velociraptor's deepest footgun is the event vs client/server artefact distinction: only event-class artefacts stream telemetry continuously. A deployment with 18% event coverage gates 82% of its detection on someone starting a hunt. The longest sleep schedule in the stack.

IDS. Suricata, Snort, Zeek, network-flow inspection. Continuous in principle, but ruleset reloads, ring-buffer drops under load, and protocol-decoder gaps (QUIC, encrypted SNI, HTTP/3) create class-wide blackouts. F2B's pyinotify backend racing with logrotate is the canonical small-window example: the watch on the rotated logfile takes 0.5-2 seconds to re-establish, and credential-stuffing bursts inside never trigger a ban.

3. The compound-blackout problem

Single-class blackouts are bad. Compound blackouts -- windows where three or more classes are simultaneously blind -- are the failure mode that lets an attacker run the full kill chain end to end without ever crossing a sensor.

Walk through a synthetic 24-hour timeline on a typical Linux endpoint stack: Wazuh manager + agent (FIM at 12-hour cadence on /etc, 24-hour on /usr, vuln-detector at 24-hour fetch, SCA at 12-hour), Falco modern-bpf service, auditd with 180 rules, Velociraptor with 12% event coverage and 24-hour forensic hunts, GRR Foreman at 600 seconds, F2B on the pyinotify backend, Wazuh wodle-aws at 10-minute polls, daily logrotate at 04:00.

The day looks like this. At 03:59:55 Falco receives its weekly SIGHUP from a Helm-driven rule sync; the rule engine is rebuilding for 3.2 seconds. At 04:00:00 logrotate runs against /var/log/auth.log; F2B's pyinotify backend loses its watch on the rotated inode for ~1.4 seconds. At 04:00:00 the FIM scheduled scan for /etc completes and the next one is queued for 16:00, opening a 12-hour FIM blackout on the highest-value tree. At 04:00:01 auditd is reloaded by the nightly Ansible converge -- the 180-rule reload takes 240 ms, during which no syscall is audited. That single second -- 04:00:00 to 04:00:01 -- has Falco syscall blind, F2B IDS blind, and FIM scheduled-cycle blind. Three classes. Severity: high.

It gets worse. The Wazuh vuln-detector's last successful NVD fetch was at 02:14 the previous day, 25 hours and 46 minutes ago -- vuln-class is blind to anything published after that. The wodle-aws last polled CloudTrail at 03:50; the next poll is 04:00, and the attacker's IAM persistence call landed in the bucket at 03:51. CloudTrail's S3 delivery added 11 minutes; the event will not be in Wazuh's pipeline until 04:00 + delivery lag, call it 04:11. From 03:51 to 04:11, cloud-class is blind to that specific call. GRR's last forensic hunt fired at 04:00 yesterday; the next is 04:00 today, after the rotation event. Hunt-class has been blind for 23 hours and 59 minutes when the attacker drops a scheduled task at 03:59:58.

Stack the intervals. From 03:59:55 to 04:00:01 the customer is simultaneously blind in syscall (Falco reload), IDS (F2B inotify race), FIM (cycle boundary), vuln (stale feed), cloud (between polls), and hunt (between forensic hunts). Six of six. For six seconds. An attacker whose tooling is timed to that window -- and timing to that window is trivial because every component above is observable from a single rented neighbour-host -- gets a quasi-perfect operational moment. They can drop a payload, exec it, write a persistence artifact, exfil, and clean up, and the only telemetry that ever fires is the one that was going to fire 23 hours later anyway.

This is not an exotic scenario. The cadence stagger on /etc versus /usr is documented in the Wazuh <syscheck> defaults. The 04:00 logrotate window is a Linux convention. The Foreman 600-second poll is the GRR upstream default. The Falco SIGHUP rebuild duration is in the Falcosecurity issue tracker (#2890) and observed empirically in the ARMO "Bad io_uring" research. The wodle-aws 10-minute interval is the Wazuh-published default. The compound blackout is what you get when a customer accepts every default and never measures the time domain.

What makes it pernicious is that no stock dashboard surfaces the intersection. Wazuh shows the last vuln-detector fetch; Falco's /metrics shows the last reload duration; GRR shows hunt schedules. None show "at 04:00 local, six classes were blind simultaneously for six seconds." Compound blackouts are an emergent property; you only see them if you compute them.

4. How to compute it

The Detection Window Auditor (DWA) engine is a pure in-memory analyser -- no I/O, no implicit datetime.now() calls so the unit tests are deterministic. Callers feed it WindowEvent records describing intervals during which each detection class was active, and it answers two questions: where are the per-class gaps, and where do the gaps overlap.

Per-class blackouts are straightforward. For each class, clamp every interval into the trailing horizon, merge overlapping/adjacent intervals, and walk the merged list emitting any positive gap as a blackout. Each blackout is annotated with the canonical MITRE techniques for that class so the report renders "during this 47-minute FIM blackout T1547.001/T1543.003/T1574.011/T1222 were undetectable" rather than a raw timestamp pair.

Compound blackouts are a textbook sweep-line. Each per-class blackout emits a +1 event at its start and a -1 event at its end; the combined tape is sorted with end-events ordered before start-events at an equal timestamp so back-to-back blackouts merge cleanly. A left-to-right sweep tracks the active blind set; the engine emits a CompoundBlackout whenever the active set transitions across the configurable min_classes threshold, accumulating a witness set of every class that was blind at any point during the window. Severity buckets fall out of the duration on industry-standard alarm boundaries: > 600s is critical, > 60s is high, > 5s is medium, otherwise low.

The whole pipeline is O(N log N) on the total event count and runs in microseconds on a one-day horizon with twenty events per class across six classes. The discipline that matters more than the constant factor is that every horizon edge is supplied by the caller -- the engine never reads the wall clock -- which makes the unit tests boring. That is always the goal.

5. The 12 tests we just shipped

The probes that feed the engine ship in the ENDPOINT-WINDOW-001..012 catalog family. All twelve are config-introspection probes -- we never SIGHUP a customer's daemon, never trigger a logrotate, never run auditctl against a live host. We read declarative state and compute timing properties.

ENDPOINT-WINDOW-001 (falco-reload-race-probe). Reads falco.metrics.reload_duration_seconds from /metrics if exposed and the rule-file inventory from the daemon config; computes the worst-case 2-5 second syscall blackout per SIGHUP. T1562.001.

ENDPOINT-WINDOW-002 (schedule-stagger-cartographer). Highest-leverage probe in the family. Maps every directory under <syscheck> in ossec.conf to its scan interval and computes the longest contiguous 24-hour window in which no directory is scanning. 4-hour-plus gaps are common in customer configs. T1070.004.

ENDPOINT-WINDOW-003 (grr-hunt-cadence-inferer). Sums GRR Foreman client_min_poll_interval plus configured hunt frequencies to bound MTTD for forensic-grade artefacts (registry persistence, scheduled tasks, WMI subscriptions). 23-hour persistence windows are typical.

ENDPOINT-WINDOW-004 (vr-event-vs-hunt-coverage). Computes the ratio of Velociraptor artefacts that run as event (continuous) vs client/server (hunt-driven). Deployments below 20% event coverage of high-value artefacts have a continuous-monitoring blindspot for the other 80%.

ENDPOINT-WINDOW-005 (vuln-feed-freshness-probe). Single Wazuh manager-API call. Per-feed last_update older than 48 hours is a hard FAIL -- 48 hours is an entire patch-Tuesday cycle of invisible CVEs.

ENDPOINT-WINDOW-006 (nvd-to-wazuh-lag-histogram). Samples N recent CVEs and computes p50/p95/p99 lag between NVD publish and Wazuh first-match. p95 > 7 days is chronic; p99 > 30 days means tail-CVE detection trails attacker weaponisation by an order of magnitude.

ENDPOINT-WINDOW-007 (auditd-reload-window-sizer). Reads the rule count from /etc/audit/audit.rules and bounds the reload blackout at ~1-3 ms per rule. 200 rules is a 200-600 ms window; a fork+exec+rm chain fits inside it.

ENDPOINT-WINDOW-008 (cloud-wodle-poll-latency-map). Sums CloudTrail S3 delivery SLA (5-min advertised, 15-min p99) plus wodle-aws <interval> plus decoder pipeline. p99 is routinely 25-30 minutes -- the lower bound on cloud MTTD.

ENDPOINT-WINDOW-009 (cross-cloud-wodle-coverage-gap). Diffs configured wodle regions against canonical AWS/Azure/GCP region lists. Critical severity because the blind state is total per region, not a window.

ENDPOINT-WINDOW-010 (f2b-logrotate-inotify-race). Detects the pyinotify backend + create/missing-postrotate combination that produces the 0.5-2 second F2B IDS blackout at logrotate time.

ENDPOINT-WINDOW-011 (ebpf-probe-boot-race-detector). Single-grep audit on /etc/systemd/system/falco.service[.d/] for Before=ssh.service docker.service containerd.service kubelet.service. Boot-time SSH brute-forcers get a 5-30 second free window when this is missing.

ENDPOINT-WINDOW-012 (wazuh-sca-periodicity-auditor). Combines with -002 to surface the FIM-plus-SCA compound blackout. Configuration-drift attacks (sysctl, sudoers, /etc/security/*) are invisible to both control classes simultaneously when intervals overlap.

All twelve are tier-3 (Enterprise), Purple-team owned, gated on internal-access because they need either the Wazuh manager API or a host-side config snapshot. Each emits a structured evidence record with the measured window duration and the TTP set the blackout exposes, so the report engine can render the timeline directly without bespoke per-test glue.

6. What to do once you can see the gaps

Once the auditor surfaces a compound blackout, the remediation plays are concrete and cheap. None of them require a new tool; all of them require treating cadence as a first-class part of the detection budget.

Stagger schedules onto a finer grid. The single highest-impact change is converting /etc, /usr/bin, /usr/sbin, /root/.ssh, /home/*/.ssh to realtime="yes" in <syscheck> so inotify catches changes within milliseconds. For trees that genuinely need scheduled mode (large content roots), put them on a 1-hour grid with a 4-hour hard cap on the longest contiguous gap. Wazuh SCA on a 12-hour interval; FIM on a 12-hour interval; offset SCA to even hours and FIM to odd hours so the compound blind window collapses.

Increase scrape frequency on stale-feed classes. Drop <update_interval> to one hour for every vuln-detector feed; alert on any feed older than six hours from Wazuh's own monitoring. Drop wodle-aws <interval> to 60 seconds for production accounts; better still, route CloudTrail Lake or EventBridge to Wazuh via a webhook decoder for sub-minute API capture. Drop the GRR Foreman client_min_poll_interval to 60 seconds for high-value hosts; run a "continuous baseline" hunt at 1-hour cadence covering registry persistence, scheduled tasks, WMI subscriptions, autostart locations.

Deploy a continuous-detection fallback for the audited gaps. This is the bigger lever. eBPF (Tetragon, Pixie, custom CO-RE programs) gives you class-wide syscall coverage that does not vanish on SIGHUP; landing it as a sibling to Falco -- not a replacement -- means the syscall class never goes blind during reloads. kAudit alongside auditd handles the userspace reload window. CloudTrail Lake or EventBridge anchored to a webhook decoder handles the cloud poll lag. Velociraptor event-class artefacts (Generic.Detection.Yara.Process, Windows.Events.ProcessCreation, Linux.Events.SSHAuthorizedKeys) push the event-coverage ratio above 60% so hunt cadence stops being the floor on MTTD. The compound blackout collapses from "six classes for six seconds" to "one class for 200 ms" -- a different shape of risk entirely.

Publish the time-domain gap as a number. The compound-blackout figure -- minutes-per-day where min_classes >= 3 are blind -- is the metric that should appear next to your ATT&CK coverage percentage on the SOC dashboard. If you only ever fix what you measure, measure when.

7. Closing

The MITRE coverage matrix is a useful artefact, but it is a steady-state model in a non-steady-state world. Defenders sleep on schedules. Reload windows happen. Feeds get stale. Hunts run when humans push the button. Coverage % is the right answer to the wrong question; coverage when is the question your incident commander actually has when the pager goes off at 04:00 -- the same 04:00 window where logrotate runs and Falco rebuilds and the Foreman polls and the vuln-feed cron lapsed yesterday.

The Detection Window Auditor is our take on making that question first-class: an engine, twelve probes, a compound-blackout metric you can put on a dashboard. If your stack has never measured its own time domain, it has a 04:00 problem and doesn't know it. Now you can find out.


References