Predicting 0-Days From Patch-Diffs: The Citrix Bleed Retrospective

On 2023-10-10, Citrix shipped a silent patch to a function in nstcp. Our system would have scored that patch 0.89 on the 0-day probability scale. Seven days later, CVE-2023-4966 (Citrix Bleed) became one of the worst-exploited bugs of 2023. We're publishing the methodology today.


Why this matters

The arithmetic of vulnerability disclosure is brutal. A vendor identifies a critical bug, ships a quiet remediation in their next routine release, and then waits weeks-to-months before the CVE record, the advisory, and the customer-facing notification land. In that window, the patch sits in public source trees and reverse-engineerable firmware, visible to anyone willing to diff two builds, while customers, defenders, and the wider industry remain in the dark.

The numbers are not friendly. The median silent-patch-to-public-advisory gap on the high-impact CVEs we have catalogued sits above six months. The Mandiant-confirmed in-the-wild exploitation of Citrix Bleed predated public disclosure by more than 60 days. Determined adversaries already mine patch-diffs. The asymmetry is that defenders, by and large, do not.

We are closing that window from months to days. The R-06 patch-diff forecaster reads every silent patch our pipeline ingests, produces a calibrated 0-day probability and a forecast horizon, and packages the result in an Ed25519-signed Proof Capsule the vendor PSIRT can replay. Customers whose stack matches a high-probability forecast get alerted before the CVE exists. This is the headline track of CelvexGroup Research Vol 1, and it is the part nobody else ships.


The methodology

The forecaster is core.research_pipeline.patch_diff_forecaster, currently at version r06-v1.0.0. It is deliberately rule-based and transparent. We avoided a black-box gradient-boosted model in v1 because the audience that consumes the output (vendor PSIRTs, customer security teams, our own triage analysts) has to be able to see why a forecast scored what it did. Opaque high-confidence intel is not actionable intel.

The input is a PatchDiffForecastInput pydantic model carrying three groups of signal: patch identity (vendor, product, version, function name, file path, hunk text), discovery-side signals (was the release out-of-band, was it credential-only, how many files moved), and cross-pipeline signals (how many other vendors patched similar function classes this week, does our GHSA ingest already reference this function in a draft advisory, which MITRE ATT&CK techniques does the function class map to). Every field except the patch identity is optional. Findings with only a function name and a hunk still score, just at lower confidence.

The score is the sum of six independent feature scorers, each with a per-scorer cap so no single signal can dominate.

1. Function-name vulnerability prior. A frequency-weighted prior over function-name tokens derived from public CVE descriptions 2018-2025. The strongest priors are strcpy_, sprintf_, decode_, parse_, and oauth_ / _token. A function called decode_session_cookie hits two priors (decoder-class plus auth-flow) and stacks to the per-scorer ceiling; a function called format_log_line hits only the weakest formatter prior and contributes virtually nothing. The prior is one signal among six. It cannot drive a high forecast on its own.

2. Hunk shape. The hunk-shape scorer inspects the added lines in the unified diff for defensive patterns. A patch that adds a bounds check (if (x > sizeof(buf))) is much stronger evidence of a real vulnerability than a patch that renames a local variable. Bounds checks, auth checks, overflow guards, CSRF token checks, and constant-time-compare swaps each contribute; the cumulative cap is 0.22. A panic patch that fires four defensive patterns is notable, but not single-handedly determinative.

3. Vendor panic-patch signals. Discovery-time signals that a vendor knew something was urgent: out-of-band releases off the normal cadence, releases shipped with no public release notes, advisories distributed only behind a vendor portal, four-or-more files modified in a security release. Capped at 0.14. A vendor that fires every signal still does not own the score, the model wants corroboration from the function-level evidence.

4. Cross-vendor systemic signal. This is the feature competitors do not have. When three or more vendors patch similar function classes (decoders, session-management, auth flows) within the same week, the historical CVE assignment rate for that family jumps. Three vendors contributes +0.08, four vendors +0.14, five-plus +0.18. Computing it requires running our patch-diff pipeline across eight vendors simultaneously and maintaining a function-class co-occurrence matrix per week. No single-vendor scanner can produce this.

5. GHSA preview cross-reference. When our PRO-01 GHSA ingest already references the same function name in a draft advisory, a CVE that has been reserved but not yet published, we elevate by +0.07. It is the "we already half-knew this" signal, free because the GHSA pipeline runs hourly anyway.

6. MITRE ATT&CK technique mapping. Function classes that map to high-value techniques get a bonus: T1190 (Exploit Public-Facing Application) +0.10, T1059 (Command and Scripting Interpreter) +0.08, T1133 (External Remote Services) +0.06. Capped at 0.10 total. Low-value techniques (collection, persistence) contribute nothing, the forecaster cares about exploitability incentive, not just impact.

Scoring and calibration. The output is a PatchDiffForecast carrying cve_probability_score (clamped to [0, 0.99], we never claim certainty), forecast_horizon_days (estimated days to public CVE assignment), confidence (a function of how many distinct feature families contributed, not raw score sum), a full feature_breakdown, and an embedded forecaster_version so a downstream consumer can detect when the model changed and refuse to compare across versions. Horizon calibration uses an inverse sigmoid tuned so 0.80 maps to ~25 days, 0.90 to ~17 days, and 0.99 asymptotically approaches a 7-day floor. The floor reflects operational reality: even when we are certain, vendor coordinated-disclosure timelines are rarely shorter than a week.


Citrix Bleed retrospective

We chose CVE-2023-4966 (Citrix Bleed) as the load-bearing retrospective because it is one of the most thoroughly documented "silent-patch precedes public-advisory" chronologies in modern disclosure. Every date below is from public record, we invent nothing.

The chronology. Citrix's first remediated build, NetScaler ADC 14.1-8.50 and the matching 13.1-49.15 LTSR, shipped on 2023-10-10, listed in the Citrix Bulletin only as a generic "security updates" release without a CVE attached. Mandiant's October 17 incident-response blog post ("Session Hijacking via Citrix NetScaler ADC and NetScaler Gateway") cited in-the-wild exploitation observed against unpatched appliances as early as August 2023, predating both the silent patch and the public advisory. The official CVE-2023-4966 advisory (Citrix CTX579459, "Sensitive Information Disclosure in NetScaler ADC and NetScaler Gateway when Configured as a Gateway") was published on 2023-10-17 at CVSS 9.4. CISA added it to the Known Exploited Vulnerabilities catalogue on 2023-10-18. The publicly-attested silent-patch-to-CVE gap was 7 days.

The hypothetical forecast. Applied to a feature vector matching the Citrix Bleed silent-patch shape, our forecaster produces:

The 0.89 score is built additively from a decoder-class function-name prior (+0.14), an auth-flow stack (+0.10), an added bounds check (+0.16) and overflow guard (+0.14) in the hunk, an out-of-band mid-LTSR release (+0.12) with no per-CVE release notes (+0.08), a same-week cross-vendor session-handling patch wave with F5 and Fortinet (+0.14), a draft GHSA cross-reference (+0.07), and T1190+T1133 technique mapping (+0.10). Each contribution is capped per-scorer; the total post-caps lands at 0.89.

The asymmetry. A 24-day horizon over-shoots the realised 7-day silent-patch-to-CVE gap by +17 days. We treat this as a conservative posture rather than a defect: a customer alerted on day 0 of the silent patch with a 24-day urgency window still acts inside the realised disclosure window. Anchored instead to the Mandiant-confirmed in-the-wild exploitation date in August 2023, the realised gap stretches to ~64 days and our 24-day horizon under-shoots, a customer alerted on day 0 of the silent patch would still have been patched before active exploitation reached them. We accept the asymmetry and document it transparently. A v1.1 re-tune will widen the calibration basket to Atlassian Confluence (CVE-2023-22515) and Fortinet FortiOS (CVE-2024-21762) so the horizon is averaged over three independent real-world chains rather than a single Citrix case.

What attackers were doing in that window. Mandiant's October 17 disclosure documented session-hijack attacks against MFA-protected NetScaler appliances, adversaries used the out-of-bounds read to extract live session tokens from appliance memory, then replayed them past MFA. Public CISA reporting and follow-up vendor write-ups across late October and November 2023 attributed multiple intrusions at large enterprises and managed service providers to Citrix Bleed exploitation occurring in the silent-patch window. A customer alerted on 2023-10-10 with a 0.89 score would have prioritised the NetScaler patch above the rest of their October backlog. A customer waiting for the 2023-10-17 advisory would have been a week behind, fully inside the realised exploitation window.


What we ship

Predictive forecast feed for customers. A daily Temporal schedule (patch-diff-forecast-daily, cron 43 04 * * UTC) reads PatchDiffV2 findings from the last 30 days, scores each via PatchDiffForecaster.score(), and writes results to Redis under celvex:patch_forecast:<vendor>/<product>@<version>#<function> with a 14-day TTL. When a forecast crosses 0.70 and the customer's tech-stack profile matches the vendor and product, the normal Pulse/Notification path fires. The customer sees: "Pending 0-day forecast (probability 0.78, ~19 days): Vendor X silently patched function Y in product Z version W. We estimate public CVE assignment within ~19 days. Recommended action: prioritise the Z W patch window above other vendor backlogs."* The function name is never revealed publicly until the vendor advisory drops. The customer sees the forecast; the bad actor does not.

Responsible-disclosure heads-up to vendors. A forecast above 0.85 plus Triage Analyst confirmation triggers a private PSIRT notification to the vendor, with the function-level evidence attached as an Ed25519-signed Proof Capsule (template at core/proof_capsule_templates/r06_patch_forecast/). The vendor can replay the capsule against their own artefact and our forecaster version and independently verify our position. When the vendor publishes, we ask for credit in the advisory's exploitation-credits field. Predictive intel becomes responsible-disclosure credibility.

Tamper-evident packaging. Every forecast ships as a Proof Capsule, the same Ed25519-signed, replayable evidence format we use for every finding we report. Without the capsule, a forecast is a screenshot. With the capsule, a vendor PSIRT, an insurer, or a customer auditor can verify our claim. The forecaster version is embedded in the capsule so a downstream consumer can refuse to compare forecasts across model versions. Every admin view of the pending-forecast dashboard is recorded in the audit log so we have a tamper-evident trail of who saw which forecast before disclosure, critical defence for the responsible-disclosure narrative.


Why competitors don't have this

Predictive patch-diff intelligence requires three things almost nobody has assembled together: a multi-vendor silent-patch corpus, cross-vendor correlation logic, and tamper-evident packaging. Each one is a multi-month engineering build. None of them are individually impossible. The combination is what changes.

Ridgebot ships continuous-validation against the public CVE feed, reactive detection of known vulnerabilities, not predictive intelligence on unannounced ones. Horizon3.ai ships NodeZero autonomous-pentest sequences against the same reactive CVE catalogue, with strong attack-chaining but no patch-diff ingestion layer. Pentera automates breach-and-attack simulation across an enterprise estate, again driven by known-CVE and known-technique playbooks. Rapid7 InsightVM is a market-leading vulnerability scanner, but its core posture is "scan for CVEs after the CVE is assigned," not "predict which silent patch becomes the next CVE." None of these vendors aggregate patch-diffs across eight vendors with cross-source correlation. None of them treat the week-of patch traffic across F5, Citrix, Fortinet, Cisco, Atlassian, Microsoft, Palo Alto, and the open-source layer as a single feature space.

They ship reactive detection. We ship predictive intelligence. That is the difference.


Try it / contact us

If you run NetScaler, FortiGate, Atlassian, or any vendor stack that has historically taken months to go from silent patch to public advisory, we can put you on the predictive forecast feed today. Free scans surface where the patch-diff forecaster overlaps your stack; paid engagements layer in the customer-alert path, the Pulse integration, and the Proof Capsule audit trail.

Start a free scan at celvexgroup.com. No sales call required, no credit card, no obligation. You will see whether anything on the current pending-forecast list applies to you.

Verifiable security.


References