Stop Trusting Vendor MITRE Coverage Claims — Measure It Yourself

1. Every vendor claims 95-percent MITRE coverage

Walk any vendor floor at RSA or Black Hat and you will hear the same number, with the same plus-or-minus rounding error. Ninety-three percent ATT&CK coverage. Ninety-five percent. Ninety-eight percent on Windows endpoints, slightly less on Linux. The figure is repeated in pitch decks, datasheets, analyst briefings, and a striking number of compliance attestations. It is rarely accompanied by the unit of measurement.

This is the dirty secret of MITRE coverage marketing. The number is almost always derived by taking the technique IDs that appear anywhere in the vendor's rule catalogue — including disabled rules, rules that have been deprecated since 2021, rules that fire only on a specific operating system version the customer does not run, and rules that have been silently shadow-overridden by the customer's own tuning — and dividing by the technique count of an arbitrarily-scoped ATT&CK matrix. If the matrix is "Enterprise minus Mobile minus PRE", the denominator drops by roughly a third and the percentage jumps. If the rule pack happens to mention T1059 once in a comment, that counts as coverage of T1059 and all its sub-techniques. The methodology is technique-name matching, not detection.

Detection engineering leads know this. CISOs frequently do not. The MITRE Engenuity ATT&CK Evaluations have been quietly publishing measured side-by-side detection results across the same vendor cohort since 2018, and the gap between vendor-claimed coverage and Engenuity-measured detection has consistently sat at the 2x-to-5x range across rounds 3, 4, 5, and 6. A vendor advertising 95 percent of Enterprise will routinely score in the 30-to-50-percent measured-detection band on the same TTPs they claim, when an experienced red team actually fires the technique.

We built a scanner architecture that closes this gap on the customer side, without waiting for an Engenuity round and without trusting the vendor's self-report. It is called the Stack Coverage Auditor (SCA). The methodology is simple to state and harder to game: fingerprint the installed defenders, parse the customer-mirrored rule packs to derive the claimed set of techniques, fire 27 canary-grade probes to derive the measured set, and report the complement, ranked by in-the-wild prevalence. This article walks the architecture and the methodology in enough detail that a detection engineering team can replicate it in-house if they prefer.

2. Three MITRE matrices, one stack

The first methodological choice is the denominator. ATT&CK is not one matrix. It is three operationally distinct matrices with overlapping but non-identical technique sets — Enterprise, Containers, and Cloud. Most defender stacks were architected against Enterprise alone, because Enterprise is the oldest and the largest, and because the marketing department understandably prefers a single percentage.

Our engine carries three static, curated technique catalogues -- Enterprise, Containers, and Cloud -- representing the high-priority commonly-exploited subset rather than the full ATT&CK matrix. The Enterprise catalogue is the largest by an order of magnitude; the Containers and Cloud catalogues are smaller, focused, and deliberately scoped to the techniques that distinguish container/cloud-native attack paths from on-host enterprise attack paths. The lists are not exhaustive -- we deliberately favoured the commonly-exploited high-priority subset, drawn from Mandiant M-Trends 2025, the CISA Known-Exploited Vulnerabilities catalogue cross-mapped to techniques, and the Engenuity round 5 and round 6 emulation plans. The denominator is stable on purpose. Coverage scoring depends on it.

The set-union catalogue, deduplicated, is what determines the score. The engine preserves provenance through iteration order, so a technique that appears in both Enterprise and Containers (T1190 is the canonical example) is tagged with the broader catalogue but the narrower context survives in the report. The customer report says "you are blind to T1190 -- and your container runtime is also blind, separately."

Here is the operational point. If a vendor tells you their stack covers 95 percent of Enterprise, they have told you nothing about whether the same stack covers your Kubernetes API server, your AWS control plane, your container-escape paths, or your cloud-credential lifecycle. If you run k8s plus AWS — which is roughly 80 percent of the customer base we audit — the maximum possible coverage of your full attack surface is bounded by the three-matrix union, and a stack tuned exclusively for Enterprise techniques is structurally capped at the Enterprise share of that union. With our curated catalogues, that share lands in the mid-70s as a percentage. Apply a typical Engenuity round-5/round-6 measured-detection rate (around half of claimed) and the effective coverage of your three-matrix surface collapses into the high-30s. Before you tune anything. Before you account for shadow overrides or silenced alerts.

That is the floor. The article continues from there.

3. In-the-wild prevalence is the truth

The second methodological choice is the weighting. A coverage score that treats T1190 (Exploit Public-Facing Application — rank 1) and T1053 (Scheduled Task/Job — rank 104) as equally valuable will mislead. They are not equally valuable. Adversaries do not select techniques uniformly from the matrix; they select from the long-tail-skewed distribution of what works in the wild, and the distribution is dominated by maybe twenty techniques that show up in roughly 80 percent of incident-response engagements.

Our engine carries that distribution as a static prevalence map of about a hundred ranked techniques organised into four tiers. Tier 1 is the ubiquitous-in-incident-response band -- Exploit Public-Facing App (T1190), Phishing variants (T1566.*), Valid Accounts (T1078, T1078.004), command-and-scripting interpreters (T1059, T1059.001/003/004), Ransomware encryption (T1486), LSASS credential access (T1003.001), brute-force/password-spray (T1110, T1110.003), and Web Shell deployment (T1505.003) lead the list. Tier 2 is the daily band, tier 3 the weekly band, and tier 4 the monthly or cloud-and-container-native tail. The map is curated from Mandiant M-Trends 2025, the CISA KEV catalogue, and the MITRE Engenuity round-5 and round-6 result sets. Anything outside the map defaults to long-tail.

The distribution is sharply skewed. The top fifteen techniques together represent more than half of the adversary activity our research team observed in the same window of CISA KEV additions and Mandiant case data. The rank-1-to-rank-40 band represents roughly 80 percent. Rank-40-to-rank-104 — the part the marketing percentage sweeps in — represents the remaining 20 percent, and most of it is detected coincidentally by tooling aimed at higher-rank techniques.

This is why our coverage report is not a single number. The dashboard tile reads, illustratively, "47 of N commonly-exploited TTPs covered" -- N is the deduplicated three-matrix union size, and the numerator is the count of techniques the customer's installed rule packs actually annotate. But the priority output is the blind matrix: the complement of the covered set, sorted by prevalence rank ascending so the most-exploited blind techniques surface first.

The customer-facing report does not say "you are 22-percent covered." It says "the five most-exploited techniques you are blind to are T1190 (rank 1), T1003.001, T1486, T1110.003, and T1505.003 -- here are the rule references in the upstream Falco, Wazuh, and CrowdSec hubs that you can enable, customise, or compensate for in the next two weeks." The number is for the executive summary. The ranked blind list is the work backlog.

This is the difference between vendor-claimed coverage and measured coverage that matters operationally. A defender stack with 60-percent claimed coverage that closes ranks 1 through 20 is dramatically more valuable than a stack with 90-percent claimed coverage where rank 1 is silently disabled. The number alone cannot tell you which is which. The blind matrix can.

4. The canary probe library

The third methodological choice is the active validation. A rule pack that annotates coverage of T1059.001 is not the same as a rule that fires on T1059.001. Static rule-pack parsing tells you what the operator intended. It does not tell you what the operator achieved.

The SCA engine ships a canary-grade probe library that closes this gap. Each entry maps a high-prevalence MITRE technique to the defender or defenders expected to detect it, plus a probe shape the active scanner family validates. We do not publish the full library mapping -- the per-technique probe shapes and expected-detector pairings are part of how we keep our active validation distinct from a public emulation plan -- but the methodology is what matters here.

A canary-grade probe is built around three properties. First, it is safe but distinctive: it fires a sentinel payload that the technique-class signature should match, but the payload itself causes no operational harm and includes a unique marker the SOC can correlate after the fact. A defender-tampering canary sends a controlled signal to a specifically-named decoy process whose only job is to be triggered; it does not actually disable the defender. An LSASS-class canary opens a handle with a sentinel access mask against a sacrificial process named to look like the target, never the real one. The point is to fire the signature, not to weaponise the technique.

Second, the probe carries an expected detection signature. This is the decisive scoring input. We do not score on whether the SOC noticed the probe; we score on whether the named upstream rule fired, with the named detector, inside an SLA window. If the customer is running Falco and a defender-tampering canary fires but the corresponding Falco rule does not log within the SLA, the rule is either disabled, shadow-overridden, or absent from the shipped pack. That is a measured blind spot, not a claimed one.

Third, the probe is replayable and evidentiary. Each canary writes a deterministic event into a per-engagement timeline so the customer's detection engineering team can rerun the probe after they tune, and produce a measured before-and-after delta. The Engenuity rounds get a single shot at this every twelve to eighteen months. Our customers get a probe campaign every quarter, against the in-the-wild top techniques, with their actual installed rule pack.

The library was chosen to span the full prevalence spectrum but to weight toward tier-1 and tier-2. Most probes cover techniques inside the top thirty by prevalence. A handful cover container-specific and cloud-adjacent techniques (T1611, T1610, T1486) because that is where the customer is structurally most likely to be blind. A separate slice covers defense-evasion and indicator-removal techniques (T1562.x, T1070.x, T1014, T1027, T1218, T1565.001) because those are the techniques an attacker uses against the defender stack itself, and they are the techniques where a measured-versus-claimed gap is the most operationally dangerous -- a stack that thinks it is detecting tool tampering but is not is worse than no stack at all.

5. The five SCA tests

The SCA architecture surfaces in the customer scanner as five tests, ENDPOINT-STACK-AUDIT-001 through 005. They are tier-3 enterprise gated; defender-stack auditing is differentiator product, not free-tier baseline.

ENDPOINT-STACK-AUDIT-001 is the headline. Composite probe that runs all three SCA phases — fingerprint, parse, score — and fails when effective coverage falls below the published gate threshold and at least one tier-1 technique is blind. The output is the dashboard tile and the ranked blind matrix. Severity is High. CWE-693 (Protection Mechanism Failure), CWE-778 (Insufficient Logging), CWE-1188 (Insecure Default Initialization).

ENDPOINT-STACK-AUDIT-002 is the Falco shadow-override probe. Falco loads local_rules.yaml after falco_rules.yaml, and the priority: and override: fields silently shadow upstream rules. The probe static-greps the customer-mirrored configuration for override: replace and cross-references the shadowed rule's MITRE annotation against the prevalence map. Fails when at least one shadow-override silences a top-20 tier-1 technique. This is a documented Falco operator footgun and maps cleanly to T1562.001 Disable or Modify Tools — except it is the operator doing the disabling, accidentally, against themselves.

ENDPOINT-STACK-AUDIT-003 is the Wazuh AMSI decoder probe. The Wazuh 4.7 release added official AMSI decoder support, but customers running an older ruleset against a 4.7+ binary do not get it. The probe checks the manager API banner for the version, then checks the decoders directory for the AMSI decoder. Fails when binary is 4.7+ and the AMSI decoder is absent. Without it, PowerShell AMSI bypass — the single most-common Windows post-exploitation technique we observe in 2025-2026 — goes unflagged. T1059.001 plus T1027 plus a T1562.001 rationale because the customer's binary-versus-ruleset drift is itself a defender-impair condition.

ENDPOINT-STACK-AUDIT-004 is the auditd backlog-loss probe. Static-mirror parse of auditctl -s output for lost > 0. Any non-zero lost value means auditd is dropping events under load, and whatever ATT&CK techniques those dropped events would have flagged are unobservable post-hoc. Backlog loss is silent — there is no alert when events drop — and an attacker who triggers high syscall volume can deliberately dilute their malicious syscall stream into the noise. Direct T1499.001 (OS Resource Exhaustion) and T1070.006 (Indicator Removal) abuse vector.

ENDPOINT-STACK-AUDIT-005 is the commercial-EDR banner-only probe. Commercial EDRs do not ship their detection rules in customer-readable form. The auditor cannot validate claimed coverage by parsing the rule pack, because the rule pack is not exposed. The probe fingerprints commercial-EDR banners (the canonical five) and fails as INFO-level when an EDR banner is detected and no rule-pack mirror is available. The finding documents the assumption gap so the SOC can flag it explicitly in compliance attestations. The remediation guidance is direct: request a coverage attestation from your EDR vendor naming the specific MITRE techniques covered with rule IDs, and run an active canary-probe campaign against the in-the-wild top 40 to measure actual detection. Anything else is assumed coverage, not validated coverage.

6. What to demand from your vendor

The reason this article is structured around an engine and a probe library is that, with both in hand, you can replace marketing arguments with measured arguments in your next vendor review. Three questions, in order, are sufficient.

First: which Enterprise techniques are you blind to, and what is their in-the-wild prevalence rank? A vendor that has done the work can answer this in a structured table. A vendor that has not will pivot to overall-coverage percentages or to "we cover the techniques our customers care about." Press. The right answer is a list of technique IDs and a rationale. The honest answer is short. The dishonest answer is long and percentage-laden.

Second: what is your latest canary-test result against the in-the-wild top 40? This is the question the Engenuity rounds answer for the participants who choose to participate. A vendor running an internal red-team programme — most credible vendors do — will have measured detection rates against the top 40, broken down by technique, against their default rule pack on a stock customer install. If they cannot produce the result, they have not measured. If they produce a result that is a single percentage, ask for the breakdown. If they produce a breakdown that exactly matches the marketing percentage, ask which round it was measured in and against which install. A measured number ages and degrades; a marketing number does not. The asymmetry tells you which is which.

Third: can you show me the five most-exploited techniques you do not cover? This is the inversion of the standard sales question. Every vendor has a list. The mature ones publish it. The honest ones know which techniques their architecture cannot reach (kernel-side defenders are blind to userspace-only TTPs, agent-based defenders are blind to gVisor and Kata Containers, cloud-only defenders are blind to on-prem lateral movement) and will name them. The vendor that claims to cover everything, including the techniques their architecture structurally cannot observe, is signalling that they have not done the analysis — or that they have, and they are choosing not to share it.

The supplementary fourth question, useful for due-diligence rather than first-meeting: can the customer mirror your rule pack and audit it? The answer for open-source defenders is yes by definition. For commercial defenders, the answer is increasingly yes — leading EDR vendors now ship a redacted detection-rule manifest to enterprise customers under NDA, precisely because enterprise customers have started asking. A vendor that refuses is not necessarily a bad vendor; their detection rules are their crown jewels. But the refusal converts measured coverage into claimed coverage, and the customer should price the difference.

7. Closing

The objective of this article is not to argue that any specific vendor is bad. Several of the defender products in the SCA fingerprint map are excellent at the things they were built for. The objective is to argue that the unit of coverage that matters operationally is the customer's installed rule pack, validated by canary-grade probes, ranked by in-the-wild prevalence — and that this unit is almost never what the vendor is reporting in a sales conversation.

The MITRE Engenuity rounds make this gap empirically visible roughly every twelve to eighteen months for a small cohort of vendors who choose to participate. The Stack Coverage Auditor makes it visible quarterly, on the customer's actual install, against three matrices instead of one. The methodology is transparent at the framework level: three curated catalogues, a four-tier in-the-wild prevalence map, a canary library targeted at the top of the prevalence curve, five tests, and an engine with no network dependencies. Detection engineering teams that want to replicate the framework in-house can do so in roughly a sprint.

The harder change is cultural. Replace "we have 95-percent ATT&CK coverage" with "we have measured detection on most of the high-prevalence TTPs against our installed rule pack, and the five most-exploited blind techniques are on the next sprint's backlog." The first sentence is comforting and unfalsifiable. The second is uncomfortable and falsifiable. Detection engineering leads who spend 2026 making the second sentence routine inside their organisations will spend 2027 with measurably less risk.