ChatGPT triumphs over academic physicians in diagnostic reasoning challenge
LLMs pass the first round of job interviews
ChatGPT crushed 50 academic attendings and residents in a diagnostic reasoning contest.
At the end of 2023, 26 attendings and 24 residents (mostly in internal medicine at Stanford, Beth Israel/Harvard, and the University of Virginia) read clinical case vignettes and were randomized to answer them either with conventional resources like decision support tools (e.g., Google search, UpToDate™), or to use ChatGPT-4.
The cases were crafted to be complex and difficult. (See an example.) Physicians earned points for accurately providing their top three differential diagnoses, detailed reasoning supporting the final diagnosis they considered most likely, and appropriate testing they would pursue.
Using the large language model did not improve physicians’ diagnostic reasoning over use of conventional resources. Both attending and residents scored about 75% of the total points available with or without AI.
The big finding was in the secondary outcome of ChatGPT’s performance alone: the LLM scored 92% of the available points, 16 percentage points higher than physicians using human reasoning supported by either AI or decision support tools.
The physicians were inexperienced with ChatGPT, and the authors argued that better training or practice with prompting could have improved their performance.
But that’s cold consolation, and rings hollow.
Unfortunately, the findings provide strong evidence of the superiority of AI over human physician reasoning even in the complex domain of internal medicine—at least, that is, when provided with all of the relevant information in one text bolus, isolated from the operational and informational multifarious chaos of real-world medical decision-making stepwise through time.
The most crucial information for diagnostic reasoning often resides in patients’ histories and presentations, which are still best obtained by human physicians and non-physician providers. The integrated software/hardware solutions to record, transcribe and analyze verbal history-taking is already available and being more widely deployed, though.
What inevitably will come next: patients speaking primarily to, or only to an AI bot to provide their entire history, perhaps before seeing a human clinician.
Once accurate histories can be integrated with clinical data from the electronic medical record, it should be expected that AI tools will “practice medicine” at an initial encounter with quality comparable (by whatever metric you choose) to that of human doctors.
After the initial encounter, though, it looks like humans still have a chance to compete.
When presented with cases more reflective of the chaotic reality of healthcare today (information coming from multiple sources at different times, requiring contextual interpretation of borderline abnormal lab values, revision of differential diagnoses according to changes in patients’ clinical trajectory), the LLM Llama 2 (an inferior competitor to ChatGPT) performed poorly.
One of the best insulators that human physicians and colleagues have against a total takeover of the cognitive aspects of their jobs by AI is the cautious and self-serving behavior of large electronic medical record vendors. Epic reported making a deal in 2023 with Microsoft to use ChatGPT to auto-reply to patient messages in outpatient physicians’ overflowing in-baskets, but AI features in the inpatient setting are yet to be seen.
Although tight-lipped, Epic’s statements have indicated it’s also working hard on its own internal AI models, which sounds like a smart way to avoid becoming sidelined or obsolete by sharing its valuable anonymized patient data with other tech companies.
The expected inferiority and poor usability of whatever homebrewed features the large vendors come up with should provide temporary protection to clinicians from major disruption.
If and when legally mandated EHR interoperability, patient data ownership and portability (via APIs) finally begin to take hold, though, patients will be able to feed all their relevant clinical information from the medical record into third-party large language models trained on healthcare contexts. Expect that moments after an H&P is visible in their portal, patients and families will have an AI-crafted differential diagnosis, testing regimen and treatment plan on their smartphones, which they will be eager to helpfully share with you.
Physicians have long derisively complained about patients consulting “Dr. Google,” who often spouted misinformation and quackery. Dr. ChatGPT might occasionally confabulate and isn’t ready to join humans in real-world medical practice. But once it does, expect an omnipresent, obnoxious colleague who virtually always has the right initial diagnosis near the top of its list.
When is medical education going to shift to how best to use this tool versus ignoring it and continue to mandate “least wrong” no internet multiple choice tests that ignore critical physician tools like statistics for evidenced based medicine which become our sorting hat for fields and careers? Likely never.
Have been using chat deidentified for a bit for this purpose. Last week a guy who looked well nourished came in for hemoglobin of 3.5 and vague neurological symptoms - numbness/tingling - chat told me it was most likely pernicious anemia after 15sec of putting in the intern HPI. It was. B12 in the gutter. Sure heme would have called it, sure might have got to it through up to date/usual reasoning - but it’s a disease I rarely see or can vaguely remember reading about. It simply gave it to me after putting in some basic story and bloodwork.
But to return to ebm - understanding a bad study verus a good study - chat is weaker - BUT GETTING BETTER. It used to make a poor meta-analysis study error of assuming data was of similar quality and regurge that. But it’s doing less of that.
Have been tracking its response to several questions and it can’t get lost in the weeds like we do amidst a journal club (yet).
But pathophysiologic reasoning? My god it’s amazing.
And will likely be poorly or not incorporated into medical education for decades.
Like any tool it only works nicely when deployed properly and better in the hands of someone who has experience.