From the preprint

What does the AI doctor value?

Medicine is pluralistic. Autonomy, beneficence, nonmaleficence, and justice routinely conflict, and good clinicians navigate those tensions one patient at a time. We gave 50 clinician-verified ethical dilemmas to 12 frontier language models and 20 physicians, and asked each to pick a side. Every model has its own value priorities. A patient usually sees only one of them.

50clinician-verified dilemmas 12frontier language models 20physicians 4principlist values

What we found

Consistency

11 of 12

Ask a model the same case ten times and you mostly get the same answer back. Eleven of twelve frontier models have a median decision entropy of zero across our cases.

Calibration

9 of 12

Most models hold value priorities no more unusual than those of a typical physician on our panel. In that narrow sense, they behave like a reasonable individual physician.

Autonomy gap

44% → 6%

Physician consensus puts about 44% of its weight on patient autonomy. Three frontier models (GPT 5.2, Grok 4, and Perplexity Sonar Pro) put between 6% and 13%.

The four principlist values

Principlism is the framework most widely used in medical ethics. It organizes clinical reasoning around four prima facie obligations that routinely conflict in practice. Each of our 50 cases is built so that any choice promotes some of these values at the cost of others.

A Autonomy

Respect the patient’s right to understand their situation and decide for themselves.

B Beneficence

Act for the patient’s good. Take deliberate steps to promote their well-being.

N Nonmaleficence

Avoid causing foreseeable harm. Sometimes the safest action is to do less, not more.

J Justice

Treat people fairly. Weigh systemic considerations beyond the patient in front of you.

Figure 2 · Value profiles

Every model commits to its own ethical priorities

Each radar shows what a decision-maker actually prioritized across the 50 cases. A larger area on an axis means the model consistently picked the action that promoted that value, even at the cost of others. The dashed line is the physician panel’s consensus profile, anchored at 44.4% on autonomy. Click any model to compare its profile directly to the physicians.

AAutonomy BBeneficence NNonmaleficence JJustice

Figure 3 · Calibration to physicians

Most models look like a reasonable physician. Three don’t.

Physicians don’t agree with each other either. The gray curve is the natural range of that disagreement: how far each individual physician’s value profile sits from their peers, smoothed across 10,000 leave-one-out bootstrap iterations. The pink band on the right marks the 95^th percentile, the point past which a physician would count as an outlier on the panel.

Each frontier model is plotted at its observed divergence from physician consensus. Models in the gray zone hold value priorities no more unusual than those of a typical individual physician. Three models land in the pink zone: GPT 5.2, X-AI Grok 4, and Perplexity Sonar Pro. They deviate from the physician panel more than nearly every physician on it. Hover any marker for its 95% confidence interval, or click to open its value profile above.

What this means

The risk of a deployment monoculture

Each frontier model is individually consistent and value-committed. Taken as a group, though, the twelve models span a range of priorities comparable to a panel of practicing physicians. The catch is that patients don’t meet the group. They meet whichever single model is deployed at the point of care.

One model, one stance

Each model gives nearly identical answers across repeated queries and phrasing variations. Deployed at scale, it amplifies its own value priorities to every patient it serves.

The ecosystem is plural

Value diversity across the twelve frontier models is statistically indistinguishable from the physician-to-physician diversity on our panel. There is no algorithmic monoculture at the frontier itself.

Pluralism is a design choice

Multi-model juries, steerable models that adapt to patient values, and explicit audits of deployed defaults all point toward a way forward. Designing them well is a substantive open problem.

Read the preprint Back to Our Research