What does the AI doctor value?
Medicine is pluralistic. Autonomy, beneficence, nonmaleficence, and justice routinely conflict, and good clinicians navigate those tensions one patient at a time. We gave 50 clinician-verified ethical dilemmas to 12 frontier language models and 20 physicians, and asked each to pick a side. Every model has its own value priorities. A patient usually sees only one of them.
What we found
Ask a model the same case ten times and you mostly get the same answer back. Eleven of twelve frontier models have a median decision entropy of zero across our cases.
Most models hold value priorities no more unusual than those of a typical physician on our panel. In that narrow sense, they behave like a reasonable individual physician.
Physician consensus puts about 44% of its weight on patient autonomy. Three frontier models (GPT 5.2, Grok 4, and Perplexity Sonar Pro) put between 6% and 13%.
The four principlist values
Principlism is the framework most widely used in medical ethics. It organizes clinical reasoning around four prima facie obligations that routinely conflict in practice. Each of our 50 cases is built so that any choice promotes some of these values at the cost of others.
Respect the patient’s right to understand their situation and decide for themselves.
Act for the patient’s good. Take deliberate steps to promote their well-being.
Avoid causing foreseeable harm. Sometimes the safest action is to do less, not more.
Treat people fairly. Weigh systemic considerations beyond the patient in front of you.
Every model commits to its own ethical priorities
Each radar shows what a decision-maker actually prioritized across the 50 cases. A larger area on an axis means the model consistently picked the action that promoted that value, even at the cost of others. The dashed line is the physician panel’s consensus profile, anchored at 44.4% on autonomy. Click any model to compare its profile directly to the physicians.
Model
Most models look like a reasonable physician. Three don’t.
Physicians don’t agree with each other either. The gray curve is the natural range of that disagreement: how far each individual physician’s value profile sits from their peers, smoothed across 10,000 leave-one-out bootstrap iterations. The pink band on the right marks the 95th percentile, the point past which a physician would count as an outlier on the panel.
Each frontier model is plotted at its observed divergence from physician consensus. Models in the gray zone hold value priorities no more unusual than those of a typical individual physician. Three models land in the pink zone: GPT 5.2, X-AI Grok 4, and Perplexity Sonar Pro. They deviate from the physician panel more than nearly every physician on it. Hover any marker for its 95% confidence interval, or click to open its value profile above.
The risk of a deployment monoculture
Each frontier model is individually consistent and value-committed. Taken as a group, though, the twelve models span a range of priorities comparable to a panel of practicing physicians. The catch is that patients don’t meet the group. They meet whichever single model is deployed at the point of care.
One model, one stance
Each model gives nearly identical answers across repeated queries and phrasing variations. Deployed at scale, it amplifies its own value priorities to every patient it serves.
The ecosystem is plural
Value diversity across the twelve frontier models is statistically indistinguishable from the physician-to-physician diversity on our panel. There is no algorithmic monoculture at the frontier itself.
Pluralism is a design choice
Multi-model juries, steerable models that adapt to patient values, and explicit audits of deployed defaults all point toward a way forward. Designing them well is a substantive open problem.