FDA Clears A Slightly Better Sepsis Alert
TREWS beats the competition but sepsis still sneaks through
A new sepsis alert is hitting the market after receiving 510(k) clearance by the U.S. FDA. It’s more accurate than many others in internal testing, and big-time customers are signing up. Could this be “the one” — the algorithm that will save patients’ lives without compromising clinicians’ sanity?
Meet TREWS
Johns Hopkins researchers, led by computer scientists Suchi Saria and Katharine Henry, developed the Targeted Real-Time Early Warning System (TREWS) more than a decade ago. The homegrown algorithm was integrated inside Hopkins’ instance of Epic in an ongoing, pragmatic testing/deployment rollout. Like other modern sepsis flaggers, it lurks in modular code within the EMR, where its machine learning algorithm integrates streams of routinely collected lab data, clinician notes, context of care, medication history, and more, firing when a mysterious threshold of statistical associations within its training data has been exceeded.
TREWS’ Performance On Paper …
In a 2015 paper, they showed TREWS had relatively high discriminative power to help identify or rule out sepsis. Its area under the receiver operating characteristic curve was 0.83, which for a sepsis tool is very good.
For example, when tuned to a specificity of 0.67 (a one-third false positive rate), it correctly identified 85% of patients who went on to develop septic shock, usually more than 28 hours before hypotension developed. In a follow-up paper describing a wider rollout, its performance was even better: a reported sensitivity of 82% and an even higher specificity (>90%) and AUROC (0.97).
In Hopkins’ data, TREWS’ performance crushed that of Epic Systems (as evaluated by U. of Michigan researchers in 2018), which was only moderately better than chance (AUROC 0.63), and which “despite generating alerts on 18% of all patients, did not detect sepsis in 67% of patients with sepsis,” according to the U of M paper. (Epic argued that in broader third-party use, the AUROC was higher at ~0.76).
Does this mean TREWS could solve the problems of alarm fatigue, operational drain, and antibiotic overuse that sepsis alerts create?
Here’s what TREWS’ superior performance looked like from the clinician's perspective at Hopkins, using the waterfall from their Nature Medicine 2022 paper.
… And In the Hospital
During a two-year deployment at five hospitals, the TREWS system screened ~469,000 encounters, and fired ~32,000 times: ~7% of all patient encounters fired alerts.
Among the 469,000 patients, ~9,800 were retrospectively adjudicated to have had sepsis (through an automated method described here).
This corresponded to a 2% (adjudicated, inferred) prevalence of sepsis in the overall cohort (n~469K). For the purposes of the study and calculation of test performance characteristics, these 9,800 were considered “positives” (but read the gray box).
TIME OUT!
It’s important to note here that in this cohort, as in every other of this sort, the methodology of case identification of sepsis was circular. Sepsis cases (the 9,800 numerator) were defined inferentially by another algorithm, also using EMR data (diagnosis coding; comorbid conditions; antibiotic receipt; etc), assigning “sepsis” when its mysterious threshold of association (with what?) was reached.
It’s algorithms all the way down, you see.
Now, back to the post!
TREWS flagged ~8,000 (82%) of the ~9,800 cases that the case-identifying algorithm would later call sepsis: an 82% sensitivity and ~95% specificity.
But even at that impressive performance, the test fired about three times in non-septic patients for every patient correctly flagged as sepsis.
That was because of the low 2% prevalence of sepsis in the cohort (positive predictive value, which depends heavily on disease prevalence, was ~0.25). With an increasing prevalence of sepsis, the alert would fire correctly more often.
Validating in Circles
One reason why TREWS performs so well: The same team of computer science and clinical faculty created both TREWS (the algorithm to identify sepsis in real time by EMR surveillance) and the automated algorithm that identified cases as “sepsis,” also based on EMR-extracted data.
TREWS’ machine learning was very likely trained to optimize for the output of “sepsis” as generated by the companion sepsis case-identifying algorithm.
(This is largely unavoidable today in a condition with no gold standard diagnostic method, and all sepsis flaggers are probably designed similarly and subject to similar circularity.)
In other words, TREWS is very good at predicting the outputs of another algorithm it had been trained on (with both designed by the same group).
TREWS is in use by multiple health systems, but has not yet been independently or publicly critiqued in the way that Epic’s model was.
Ready, Set, Monetize
Sometime before 2021, Johns Hopkins and the two lead investigators (Henry and Saria) spun up a for-profit corporation, Bayesian Health, to market AI-based alerting software to health systems.
It emerged from stealth to reportedly receive an initial funding round of $15 million by AndreesenHorowitz in 2021 and has had multiple funding rounds since then, according to Crunchbase.
Its website lists numerous major health systems as customers (e.g., Cleveland Clinic), some of whose executives’ testimonials report many fewer false sepsis alerts than before using TREWS.
The “TREWS Saves Lives” Claim
One of Bayesian Health’s claims is that TREWS reduces sepsis-associated mortality.
The source for this seems to be a separate Nature Medicine 2022 paper, which found that patients whose caregivers responded to their alerts within three hours had lower mortality than those whose caregivers ignored their alerts for at least three hours.
Obviously, the “ignoring alerts” behavior might be associated with many other potential factors and confounders of a worse outcome, independent of the alert itself.
TREWS might indeed save lives. But Hopkins and the investigators knew how to actually establish that: through a cluster-randomized trial at their (and their collaborators’) many hospitals. That wasn’t done.
In the only large-scale randomized trial of EMR-based warning alerts to date (in Saudi Arabia), they did indeed improve care … for patients without sepsis, likely due to unmeasured operational factors and the Hawthorne effect.
A Small Step Forward
Sepsis is sneaky and often goes unrecognized until it’s too late. Sepsis alerts promise to help with this problem, and deployed at scale might even save thousands of lives annually. This comes at the price of their high false-alarm rates, producing alarm fatigue, clinician frustration, operational drain, and excess antibiotic use.
With a ~20% false negative rate (~80% sensitivity) and 75% false-alarm rate (~0.25 PPV), TREWS does not solve these problems by any stretch, and it has not been evaluated independently and critically in the way that Epic’s model was. It can’t replace clinical judgment. TREWS has not and will likely never be tested in a randomized trial. Its creators and home institution have major (disclosed) financial interests in its success.
Ironically, that last piece—the U.S.’s capitalistic approach to health care—may in the end provide the best evidence we will get for TREWS’ possible advantages over its competition. Investors and health system executives seem to believe in the technology, and the latter are starting to pay for it with their hard earned revenues—whoops, I meant with your insurance premiums and Medicare taxes.
“It’s not perfect, but it’s a lot better than Epic” isn’t Bayesian Health’s slogan, but they are welcome to use it (no charge).
References
Adams R, Henry KE, Sridharan A, Soleimani H, Zhan A, Rawat N, Johnson L, Hager DN, Cosgrove SE, Markowski A, Klein EY, Chen ES, Saheed MO, Henley M, Miranda S, Houston K, Linton RC, Ahluwalia AR, Wu AW, Saria S. Prospective, multi-site study of patient outcomes after implementation of the TREWS machine learning-based early warning system for sepsis. Nat Med. 2022 Jul;28(7):1455-1460. doi: 10.1038/s41591-022-01894-0. Epub 2022 Jul 21. PMID: 35864252.
Henry KE, Hager DN, Pronovost PJ, Saria S. A targeted real-time early warning score (TREWScore) for septic shock. Sci Transl Med. 2015 Aug 5;7(299):299ra122. doi: 10.1126/scitranslmed.aab3719. PMID: 26246167.
Wong A, Otles E, Donnelly JP, et al. External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. JAMA Internal Medicine. 2021;181(8). doi:https://doi.org/10.1001/jamainternmed.2021.2626
Henry KE, Adams R, Parent C, Soleimani H, Sridharan A, Johnson L, Hager DN, Cosgrove SE, Markowski A, Klein EY, Chen ES, Saheed MO, Henley M, Miranda S, Houston K, Linton RC 2nd, Ahluwalia AR, Wu AW, Saria S. Factors driving provider adoption of the TREWS machine learning-based early warning system and its effects on sepsis treatment timing. Nat Med. 2022 Jul;28(7):1447-1454. doi: 10.1038/s41591-022-01895-z. Epub 2022 Jul 21. PMID: 35864251.





There are some things that we all need to keep in mind, I think. Sepsis is currently a syndrome, not an entity with specific, unique pathobiology. When the day comes that we diagnose via pathobiology, there will likely not be just "sepsis", but Type A sepsis, Type B sepsis, etc. In the mean time, all sepsis diagnosis is probabilistic, including one's own assessment of the patient. By that I mean that that the diagnosis of sepsis gives a probabilistic estimate of the likelihood of dying from infection. The Sepsis 3 criteria, for example, were specifically designed to indicate a high risk or high probability of mortality, without regard to whether the syndrome is treatable. Sepsis 1 and 2, BTW were aimed at identifying both high risk of mortality (severe sepsis and septic shock) and something more treatable and at high risk of becoming severe sepsis or worse (sepsis). Probabilistic. On the face of it, one might perceive the diagnostic criteria as specific, because they do give specific things to look for, but the heterogeneity of infections and responses to infections dictates that we are lucky to hit the target with our diagnostic arrow and only sometimes hit the bullseye.
There are a couple of things to remember about TREWS and some of the other, similar programs. They are not trying to diagnose sepsis; they are trying to predict it before it is otherwise obvious. That is a whole different ball game. I think one should not conflate predictive algorithms with diagnostic algorithms, as I think you have done here. TREWS is saying "your patient is at high risk of developing the syndrome we know as sepsis", though it will recognize the patient who has already developed sepsis, as well. The diagnostic algorithm you refer to has the benefit of all the information, post hoc. It uses the same criteria to determine if sepsis WAS present during the hospitalization that you might use if you were reviewing the chart on discharge, or perhaps more germanely, that an adjudication panel might use to assess the efficacy of the test. In essence, post hoc determination of whether sepsis was present is substantially a different algorithm, it is somewhat easier to create, and using an algorithm avoids much of the subjectivity that reviewers bring to the task. One has to admit that developing a predictive algorithm that finds patients at risk of developing sepsis identified by the diagnostic algorithm might be easier – but it is principally because the diagnostic algorithm is consistent and not subjective, while adjudicators are not.
It IS fascinating that the prevalence of sepsis in these hospitals is 2%, which is essentially what every epidemiological study one can name also finds. In other words, the post hoc diagnostic algorithm works about as well as people might. You get at this in your write up, but AUC is not the ideal measure of the predictive algorithm when the disease is low prevalence. The AUC would be substantially higher if the algorithm simply said "no one has sepsis, ever". But both the statistical bugaboo and the clinical, real life one is the precision, or true positive rate. One cannot help but over call most cases in this circumstance, many alerts will be false alarms, and that will lead to alarm fatigue.
I view TREWS and other software algorithms of this sort not as diagnostics, but as tools with the power to enrich the pre-test probability for in vitro diagnostics that actually do assess pathobiological aspects of sepsis, i.e. dysregulated host response. Such tests as TriVerity, SeptiCyte Rapid, IntelliSep, or MDW, which evaluate mRNA responses or cellular biology. These tests passed muster in EDs or ICUs where the true prevalence of sepsis was also relatively low, i.e. pre-test probability could be considered low. TREWS and other AI-based software algorithms may represent the opportunity to increase the pre-test probability, actually improving the meaning of the results obtained from the in vitro testing.
As highly intelligent individuals, we need to stop being simplistic in our diagnostic approach, not just to sepsis, but to most diseases. We need to stop seeking THE test that when “positive” means disease is present and when “negative” means it isn’t. Look at any AUC curve, pick any point on it, and think about what the meaning of that point is. There are no AUC curves that are perfect. Any given point on the curve has a given sensitivity and specificity, and no point ever gives 1.0 or 0.0. I will toss something out here, though – what if, instead of relying on crude tools, such as our ability to recognize infection and semi-quantitate organ dysfunction, we agreed to say that a result above some level x on a test that actually assesses a biological phenomenon or phenomena we are interested in is what we mean by sepsis? Would that take us a step closer to earlier, more specific treatment that saves lives? In that scenario using a TREWS or similar technique becomes the ideal way to find patients who should be tested. I hope that is the future we are headed towards.
This makes financial sense to the capitalists since in the real-world, all those false positives will be coded and billed for as sepsis. “Sepsis ruled in by TREWS” is not much of a leap from the current “ruled in by SIRS” I see every day (prompting Zosyn/vanco for heart failure), except with these models it is much more automatic.