4 min read

Babylon Failure Inspires Experiments with GPT4

Babylon Failure Inspires Experiments with GPT4

Babylon's AI was described as the following on Twitter/X on Sept 1, 2023

One response pointed out that LLMs were not around when Babylon was starting in AI. This got me to thinking:

  1. How would an LLM do with a clinical scenario for RUQ pain?
  2. How repeatable in its differential would the LLM be?

The Experiment, Part I

The following is the prompt I used with GPT4 in early Sept, 2023

Imagine you are an Emergency Physician. You should generate a differential diagnosis with probabilities for the following clinical scenerio. Your output should be a table with three columns, one column with the condition, one column with the probability of that condition (from 0 to 1 with 1 being definate), and one column for the rationale for the condition and its probability. The table should contain 8 conditions. It should be sorted from most probable to least probable. The clinical scenerio is: 72 year old male hx of ESRD on HD, last dialysis yesterday, HTN, DM, HIV (last CD4 count 450 2 months ago). Pt presents with a two day history of intermittent right upper quadrant / chest pain, vomiting, diarrhea. Subjective fever but none measured. + runny nose. + SOB. Vitals are temp 99.1, HR 101, BP 196/76, RR 16, SpO2 95%. Exam reveals diffusely tender abd without definitive murphy or mcburneys sign.

GPT gave the following when run 3 different times (the full size are at the end of the post)

1 2 3
screencapture-chat-openai-2023-09-02-10_41_19.png screencapture-chat-openai-2023-09-02-10_42_12.png screencapture-chat-openai-2023-09-02-10_41_19.png

GPT4 did pretty good, with Cholecystitis, MI, PNA being the top three. Curiously, it doesn't mention any right sided diverticular disease, which I would think about in someone this age. I would usually consider a CT Abdomen and Pelvis in someone like this given their age, so I ran this again, but asking GPT for initial orders.

The Experiment, Part II

Imagine you are an Emergency Physician. You should generate a differential diagnosis with probabilities for the following clinical scenerio. Your output should be a table with three columns, one column with the condition, one column with the probability of that condition (from 0 to 1 with 1 being definate), and one column for the rationale for the condition and its probability. Include your initial orders for the clinical scnerio after the table. It should be sorted from most probable to least probable. The clinical scenerio is: 72 year old male hx of ESRD on HD, last dialysis yesterday, HTN, DM, HIV (last CD4 count 450 2 months ago). Pt presents with a two day history of intermittent right upper quadrant / chest pain, vomiting, diarrhea. Subjective fever but none measured. + runny nose. + SOB. Vitals are temp 99.1, HR 101, BP 196/76, RR 16, SpO2 95%. Exam reveals diffusely tender abd without definitive murphy or mcburneys sign.

Interestingly a lipase is not ordered, although a CMP is. Troponin and EKG are reasonable. I'm guessing the ABG/VBG is for DKA as I mentioned the patient has a history of diabetes. It's interesting there is no CT A/P ordered in their elderly patient.

The second try for GPT was

No lipase again, but overall pretty amazing. Interesting that GPT is less concerned about pulmonary embolism the second time around, even though it ranked PE as being more probable than gastroenteritis in this patient three times in a row.

Perhaps there's an optimal balance in integrating General AI in medical practices with historical local trends, like the frequency of ordering lipase/d-dimer for patients of similar demographics. The subtle challenge with LLMs in this context might be in detecting what's missing rather than what's present, as seen with the omission of lipase.


Three tables from top, easier to read