Why Clinical AI Procurement Scorecards Fail

Two thirds of sepsis cases missed. At the alerting threshold studied, 88% of alerts were false alarms. Alerts on 18% of all hospitalised patients^[1]. That is the Epic Sepsis Model as it actually performed in external validation — across hundreds of US hospitals, in the workflows of intensive care units and emergency departments, with an external AUROC of 0.63 against vendor-reported performance of 0.76–0.83. It was deployed without a prospective trial.

I spent over a decade in emergency medicine watching tools like this arrive. The pattern repeats. A pilot study with selected sites and selected outcomes. A KPI deck. Twelve months later the alerts are silenced by default and the registrars route around the tool to get on with the shift.

If you are a CMIO, CCIO, CIO, CAIO, head of digital health, clinical governance lead, or procurement leader at a health system in the Gulf or Hong Kong, you are about to procure a generation of clinical AI.

Most vendors selling into this market have not validated their tools in Gulf Arab or East Asian populations.

Most procurement frameworks you have were built for somewhere else.

Most clinical AI procurement processes evaluate a vendor on three things: the strength of their clinical evidence, the ease of EHR integration, and the price. This is what hospital procurement was built to do: buy infusion pumps, buy beds, buy point-of-care analysers. The model maps poorly onto a tool that learns from a population it has never seen and behaves differently in deployment than in pilot.

The Epic Sepsis Model would have passed a composite-score procurement review. Strong clinical claims, on the vendor’s own data. Native Epic integration. Reasonable cost. The collapse happened in three domains that a composite review averages out: external validation in the actual deployment population, equity of performance across patient subgroups, and the downstream behaviour of clinicians once the false-positive burden became unworkable. A weighted score lets a strong evidence claim cancel out a weak governance posture. That is how composite-led procurement put a tool with an external AUROC of 0.63 into hundreds of hospitals.

I did not leave that as a claim. I ran the Epic Sepsis Model through the full evaluation instrument described below, restricted to evidence that was publicly available in August 2021 — what a buyer could have known at the time. Composite score: 37 out of 100. Recommendation: Do Not Proceed. The safety interlock fired on two domains, and the headline AUROC was responsible for neither. Population Validity scored 20: no published demographic composition of the training cohort, no subgroup analyses, deployment across hundreds of heterogeneous hospitals with no site-level validation requirement. Operational Performance scored 20: an 18% alert rate, a 12% positive predictive value, roughly eight patients evaluated per true positive, and no mechanism for any site to monitor its own performance against a reference standard. Each of those, alone, mandates Do Not Proceed regardless of the composite. These are the domains a weighted average buries.

Epic subsequently overhauled the model — a revised version reached customers from late 2021, and by October 2022 Epic was recommending hospitals train it on their own data before clinical use, with the sepsis definition moved to the Sepsis-3 standard^[2]. None of this was publicly knowable at the point of procurement, and the shift to local training itself concedes the point about site-level validity. The evaluation here is a retrospective exercise against the public record as it stood. The question it answers is what was knowable, and catchable, at the moment of procurement. The full scored report — every domain, every figure, and the verbatim evidence appendix — accompanies this article.

Sepsis prediction is the cleanest published example. Radiology triage, deterioration prediction, AI scribes, diagnostic support, and risk stratification tools share the same procurement failure mode.

For health systems in the Gulf and Hong Kong, the gap is wider. Three additional domains carry weight here that NICE and FDA frameworks were never designed to evaluate. Data sovereignty is often the decisive one. Saudi Arabia’s PDPL, enforced by SDAIA, permits cross-border transfer of personal data only under defined conditions — adequacy assessment of the receiving country, approved safeguards such as standard contractual clauses, and a risk assessment for sensitive data, which expressly includes health data. Qatar’s framework treats healthcare as a sector where the regulator can require effective data residency as a condition of system approval. The point is not that data can never leave the jurisdiction; it is that for clinical AI handling patient data, where the data is processed becomes a gating procurement question the regulator can decide against you, however strong the clinical evidence. Population validity is a structural question. Most clinical AI is trained on data from American or European tertiary centres. The evidence base for Gulf Arab or East Asian populations remains thin. In the GCC specifically, Islamic bioethical compatibility is the third gap — whether patient consent, explainability, end-of-life implications, reproductive health use cases, data governance, and human accountability have been reviewed against local religious and ethical expectations. Most vendors selling into the region cannot speak to these when asked.

A framework that scores these dimensions independently and allows a single failed domain to veto the entire procurement is the minimum credible bar.

What follows is the framework behind my Clinical AI Procurement Stress Test, a fixed-scope review of a shortlisted vendor against eight independently scored domains.

The framework started from a single question: what would have stopped Epic Sepsis from being procured? Simply asking for a stronger vendor evidence pack would not have been enough. The missing issue was structural. Procurement needed to require independent external validation, subgroup performance analysis, workflow-burden testing, and post-deployment monitoring — and to treat any one of those as a gating failure.

The framework has eight domains, each scored zero to one hundred against explicit tier criteria. Clinical evidence quality. Population validity in the deployment context. Operational performance under real-world conditions. Workflow integration. Regulatory compliance. Data governance and sovereignty. Implementation maturity, including post-market monitoring of drift and override patterns. And, in GCC contexts, Islamic bioethics compatibility. Each domain carries a weight, and the weighted scores produce a composite. So far this is unremarkable.

Three design choices do the work.

The first is that domains are scored independently before they are weighted. A strong evidence claim cannot cancel out a weak governance posture. Both scores sit in the report side by side. The buyer cannot scroll past one to get to the other. This is what the Epic Sepsis procurement reviews failed to do.

The second is a cross-domain safety interlock. Any single domain scoring below thirty out of a hundred (the Insufficient band) triggers a mandatory Do Not Proceed regardless of composite. A vendor can score 95 on every other domain and still be blocked. Some questions are gating questions. You cannot trade off jurisdictional compliance for clinical evidence: if the vendor’s data-processing arrangement cannot be brought within what the regulator will approve for patient data, no amount of clinical strength buys past it. The interlock is reserved for structural failures of that kind; remediable issues are handled by the red flag layer below.

The third is a layer of automatic red flags below the interlock. Specific findings cap the affected domain at forty regardless of how well it scores elsewhere: patient data leaving the jurisdiction, an unaddressed bias in a demographically relevant subgroup, no published prospective validation. The domain is wounded, not killed. The composite still moves and the recommendation still gets written, but the cap surfaces the issue in a way no weighted average ever will.

Above all of this sits the jurisdiction overlay. The same eight domains apply in the GCC, Hong Kong, the UK, and the US, but the weights shift and the regulatory questions inside each domain are localised. Regulatory compliance carries more weight in the GCC. The eighth domain is Islamic bioethics compatibility in GCC contexts; the same slot accommodates the locally relevant ethical framework elsewhere, and folds into other domains in secular healthcare systems. The clinical evaluation questions do not change. The regulatory environment does.

That’s the architecture. The next question is what it catches that a composite-only review would miss.

What follows is a worked example. The vendor is synthesised, not a real client; the behaviours are composites drawn from real procurement situations.

A sepsis prediction tool being evaluated for deployment at a major GCC tertiary centre. Three-site prospective RCT including one Gulf site, published in NEJM Evidence. CONSORT-AI compliant. Independent DSMB oversight. Primary endpoint 28-day mortality. Sample size 6,200. Subgroup performance shows AUC parity within 0.04 across Gulf Arab, South Asian, and Western populations, though the Gulf Arab subgroup is the smallest of the three. Real-world NPV of 94%. Sub-300ms latency. 99.7% uptime. Native Oracle Health integration. Sustained 78% clinician adoption at 18 months. MOPH Qatar registration obtained, SFDA application under review, ISO 13485 with AI-specific lifecycle compliance documented. Transparency report in English and Arabic reviewed by the hospital’s bioethics committee. Sharia advisory input documented. Series D, $220 million raised, revenue-positive, 55 hospital clients. Source code escrowed.

The composite score, weighted across the eight domains for the GCC overlay, is 82.2 out of 100. The framework’s score band for that range is Strong. A composite-led procurement review would approve and proceed.

The framework’s recommendation reads:

data sovereignty remains a critical blocker requiring resolution before full deployment.

Here is what is happening underneath the composite.

Clinical Evidence Quality: 90. Population Validity: 85. Operational Performance: 90. Workflow Integration: 85. Regulatory Compliance: 80. Implementation Maturity: 85. Islamic Bioethics Compatibility: 85.

Data Governance and Sovereignty: 40.

The vendor processes all patient data on servers physically located in the United States. The Master Services Agreement grants the vendor rights to use aggregated, de-identified hospital data for product improvement, without an opt-out mechanism, and does not name the hospital as the sole data owner. The vendor acknowledges the compliance issue for Qatar deployment and is working on a regional hosting solution, with no committed timeline.

The red flag mechanism caps the domain at 40. The interlock does not fire. 40 is above the 30 threshold, so the technical recommendation is Proceed. But the recommendation summary, the jurisdiction notes, and the domain-level report all surface the same thing: data sovereignty is the deal-killer that the vendor has not solved, and procurement cannot proceed at this site until they do.

The worked example demonstrates two things.

First, the composite alone misleads. 82.2 is a strong score. A weighted-average procurement process would green-light this vendor on the strength of seven domains scoring 80 or above. The eighth domain, carrying only 9% weight in the composite, will block the deployment on the day of go-live. The framework forces that domain to be read on its own terms.

Second, the surfacing mechanism matters. The red flag did not kill the procurement. It capped the domain, flagged the issue in the recommendation language, and shifted the burden onto the vendor to remediate. A binary approve-or-reject tool would have either let the deal through or killed it entirely. Neither serves the buyer. This is a strong vendor with one resolvable problem; the framework’s job is to surface it as such.

For the procurement and clinical governance leads reading this: much of what the framework enforces can be applied to your own process. A few practices stand out.

Score domains independently and report them independently. If you must compute a composite, publish the eight scores alongside it and never show one without the others. The composite is for summary; the scores are for decision-making.

Define your own gating questions before the vendor walks in. Decide which dimensions force a Do Not Proceed regardless of how strong the rest looks. For GCC sites, that probably includes data sovereignty, subgroup equity in your patient population, and evidence of prospective validation. For Hong Kong, it includes data residency and integration with your existing CDSS layer. Write the list down. Show it to the vendor. Make them speak to it in writing, on the first call.

Ask for the data processing geography on the first call, not the last. Where the data is physically stored. Where the inference runs. Who has access. What the MSA says about training rights, retention, and aggregated data use. If those questions are punted, that is the answer.

Ask what post-market monitoring the vendor commits to. Model drift over time. Alert fatigue and override rates. Subgroup performance tracked after deployment. A retrain cadence. A retire criterion. If the vendor’s answer ends at handover, the lifecycle of this tool will end there too.

When a vendor presents subgroup performance, ask whether the subgroups in their study match the subgroups in your patient population. Most clinical AI is trained on American or European data. “Validated across diverse populations” usually means validated across diverse American populations. That is not the same thing.

Demand independent external validation in a setting comparable to yours, not vendor-supplied validation on data the vendor has chosen. Epic reported AUROC of 0.76–0.83 on their own data. External validation by an independent team found AUROC of 0.63^[1]. That gap is the difference between a tool that works and a tool that does not.

The Epic Sepsis Model is not an outlier. It is what happens when clinical AI is procured without clinicians in the room and evaluated against frameworks built for a different category of technology. Hundreds of US hospitals deployed it. The independent validation and the reporting that followed are public. The same pattern is now playing out in vendors selling into the Gulf and Hong Kong, with the additional complication that the data, the patients, and the regulatory environment do not match what the vendors validated against.

This framework is one answer. It exists because the questions that determine whether a clinical AI tool works in deployment are not the questions vendor pitch decks answer, and the cost of getting that mismatch wrong — in alert fatigue, in deployed-then-silenced tools, in equity gaps no one was looking for — is borne by clinicians and patients, not vendors.

I’m an emergency medicine physician (MRCEM, UCL) who works on clinical AI evaluation methodology for health systems in the Gulf and Hong Kong, independent of any vendor.