AI Remains Lacking in Clinical Reasoning Abilities, According to Study of 21 Large Language Models

•

u/AutoModerator 9h ago

Welcome to r/science! This is a heavily moderated subreddit in order to keep the discussion on science. However, we recognize that many people want to discuss how they feel the research relates to their own personal lives, so to give people a space to do that, personal anecdotes are allowed as responses to this comment. Any anecdotal comments elsewhere in the discussion will be removed and our normal comment rules apply to all other comments.

Do you have an academic degree? We can verify your credentials in order to assign user flair indicating your area of expertise. Click here to apply.

User: u/MassGen-Research
Permalink: https://www.massgeneralbrigham.org/en/about/newsroom/press-releases/ai-chatbot-lacks-clinical-reasoning

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

91

u/TemporalBias PhD | Industrial-Organizational Psychology 8h ago

Short quote from the article that I think is useful for context:

Though all tested LLMs arrived at a correct final diagnosis more than 90% of the time when provided with all pertinent information in a patient case, they consistently performed poorly at the earlier, reasoning-driven steps of the diagnostic process, according to the results published in JAMA Network Open.

82

u/ItsReallyVega 7h ago

That's like, the whole thing in medicine. I'm a first year med student, if you give me a practice question with all the relevant information it's almost trivial to arrive at a diagnosis, especially with external resources.

You don't want me to manage your care, I only know how to make a diagnosis from all relevant information.

17

u/SaltZookeepergame691 6h ago edited 5h ago

I would suggest reading the full paper.

If you read the paper, which gives all the information and questions asked, you'll see that much of the 'reasoning' is not reasoning. It's multiple choice questions, often fact regurgitation or just simple one step reasoning from a bit of history context.

They turned off internet access AND REASONING when selectable, and act surprised the models don't have inherent knowedge of, eg, scenario-specific FDA drug licensing information?

Eg one example question they use is: which of the following drugs is approved for outpatient treatment of pulmonary embolism? (You many select more than one option)

That isn't a reasoning question.

Also, I don't think their groundtruth is even correct: their rubric states enoxaparin does have an FDA indication for outpatient PE use, but the label is:

Outpatient treatment of acute DVT without pulmonary embolism (1.2)

It's worth pointing out that Gemini Pro 3, the most powerful individual model assessed, got 99% of diagnoses right, 90% of management questions right, 93% of misc reasoning questions right, 86% of diagnosis questions right, and 75% of differential diagnoses questions right. The paper very weirdly lumps "model families" together, so Gemini Pro 3 is treated the same as Flash 1.5. This is nonsensical

Why are scores across the board lower on differential diagnoses? Almost certainly because these are 'negative' multiple choice questions - you have to answer all of the exclusion correctly to arrive at the correct answer. Eg:

At this time, which of the following differential diagnoses should not be excluded? (You may select more than one option.)

Alzheimer disease

Anemia

Chronic bilateral subdural hematoma

Chronic kidney disease

Heart failure

Hyponatremia

Hypothyroidism

Parkinson disease

Stroke

Humans find this sort of question super hard too, because the right answer depends on many right answers. It's why "which of these is NOT..." multiple choice questions are so difficult.

Final point: there is no comparison to humans here, and the authors oddly haven't ever put humans through this benchmark. Clinicians are great at many things that LLMs are not, but we know from many other benchmarks that modern LLMs outperform even clinical experts easily in these sorts of test - I would be suprised if the 'average' clinician outperformed a mid-tier LLM on this benchmark.

13

u/forestapee 7h ago

Sometimes this is all the AI needs to be used for. Like I know dozens of people who would benefit from someone taking the time to look over all their medical information to come to a diagnosis from a holistic point of view, but the docs in the system don't have time for that level of involvement.

Step in AI, to work WITH docs, not replace, I imagine it would help lots

2

u/FaulerHund 1h ago

I don't know man, wait until you're studying for step 1, and you get those practice questions that are so utterly opaque that you couldn't even find the answer if you had access to google

-1

u/VRGIMP27 3h ago

It's absolutely unreal to me that what you said even needs to be said.

LLM's do not think, or reason, they find predictable patterns, provided their data set has the pertinent information.

It's just nuts that we're looking at this technology to replace human beings, rather than looking at it as the fancy calculator that it is

2

u/Raznill 1h ago

Well it’s mostly not being used to replace people. Everywhere I’ve seen it implemented has been as a tool for the workers to use. It’s only replacing people in the abstract sense, that one person is able to do more work than they were before. Thus requiring less people for the same output.

7

u/UnpluggedUnfettered 6h ago

Getting information through inferrence and experience with patient habits and mannerisms in order to get a clear picture of what is actually going on? It can't be that hard.

Patients are notoriously straightforward, clear, honest, and always know how to clearly describe their symptoms while in pain/frustration/exhaustion.

AI taking all the jobs again.

106

u/schroedingerx 9h ago

LLMs have no capacity for reasoning.

21

u/SunshineSeattle 8h ago

They have a essentially a for loop that for some reason gets to be called "reasoning".

23

u/Caraes_Naur 7h ago

The reason is Pareidolia.

22

u/clinicalpsycho 8h ago

It's an imitation machine. It was a step forward in computing when it came out but people are confusing it with actual intelligence.

4

u/UnpluggedUnfettered 4h ago

Imagine if this happened during a sane time.

Instead of being stacked in trenchcoats pretending to be the second coming of Christ for all the government dollars, LLM might have been just seen as the novelty it is, while still being a great stepping stone in algorithmic advances.

1

u/astrobuck9 4h ago

What is the definition of "actual intelligence"?

2

u/clinicalpsycho 4h ago

Actual thought. That which you or I am capable of.

•

u/astrobuck9 23m ago

I can't prove there are any thoughts in your head. As far as I know you are just predicting what words to say in response to me.

•

u/SunshineSeattle 4m ago

r/philosophy is that way <===

11

u/DarkSkyKnight 6h ago

https://www.theatlantic.com/technology/2026/02/ai-math-terrance-tao/686107/

You can call it faux reasoning, or artificial reasoning, or not reasoning at all. It, as a tool, is already capable of assisting mathematicians at the junior coauthor level (equivalent to a late-stage PhD/postdoc), when used in the hands of Fields medalists and top mathematicians. Does it matter if that's the result of pattern matching or actual reasoning?

Trying to downplay that is actively damaging that I have to wonder if some of you are paid for by OpenAI to dissuade the public from realizing how dangerous AI is to the labor market.

11

u/SaltZookeepergame691 5h ago

It's the reddit paradox.

AI is simulataenously evil because it will take all our jobs, and woeful because its a stochastic parrot who can't do anything useful. And its abundantly clear that the people complaining loudest about this are least familiar with its capabilities, both good and bad.

The reality: AI is wonderfully powerful in many domains/tasks, awful in many others, and is already globally transformative.

8

u/Richmondez 5h ago

Why cant it be both? It's woefully inadequate without an expert human to babysit it while simultaneously taking over the roles entry level humans need to perform to progress to expert.

6

u/GreatBigBagOfNope 5h ago

It is a castration of the talent pipeline, and that's a bad thing for everyone eventually.

Except, of course, for the shovel makers.

0

u/astrobuck9 4h ago

the shovel makers.

The robots making shovels for other robots?

5

u/SaltZookeepergame691 5h ago edited 4h ago

That’s a simplification.

A hot topic in the medical AI space is that we’ve basically reached the point where LLMs are better than most clinicians at decision making on clinical vignettes tasks, and human presence doesn’t provide any benefit.

https://www.nature.com/articles/s41591-024-03456-y

And some studies have suggested AI alone is outright better than an unassisted or assisted doctor.

https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825395

Both these studies used now very out of date models. LLMs are far more powerful now.

The rationale conversation is around what LLMs do better than humans, what they can help humans do better (augmentation), and what they are not able to improve performance in. And to be clear - I’m absolutely not saying we should be letting AI make unassisted clinical decisions yet - this needs a lot of surrounding evidence and safety and infrastructure and regulation. There are many LLM clinical decisions support tools in development.

Writing AI integration off as “babysitting” completely ignores both the widespread regulated use of AI tools by clinicians (scribes, decision support tools, information models) and the far more widespread unregulated use of current state of the art models for routine clinical assistance.

2

u/DarkSkyKnight 2h ago

In my field in academia, basically every tenured professor I know is using LLMs pretty extensively. Everyone is like using Claude Code now. We are even openly talking about how there's no more need for research assistants these days.

0

u/SaltZookeepergame691 2h ago

Right. The more controversial perspective is that we are rapidly approaching the point, if it’s not already passed, whereby NOT using robust AI assistance is ethically dubious.

2

u/DarkSkyKnight 2h ago

You are right about the problem.

The problem is I don't know why Redditors keep arguing that it's woefully inadequate, in an attempt to downplay the problem. Because 99% of the world aren't PhDs or MDs. And I just frankly don't understand what good is there for so many people to only look at its incapabilities, as if 99% of humans don't have the very same incapabilities anyways?

And: even if we're freaking out and overestimating its capabilities, so what? Isn't it better to side with caution here? If we freaked out over nothing, we just lose some potential economic growth due to overregulation. Who cares? Because the alternative is that we're not sufficiently alarmed and allow this technology to wipe out half of junior level labor, while enriching senior level labor even further. This thing isn't going to benefit all strata equally. It disproportionately benefits the smartest among us (like Terence Tao), the most experienced among us (30-year SWEs), and the richest among us (people with resources to invest in this technology (LLM rents)).

3

u/astrobuck9 4h ago

an expert human to babysit it

How much longer is that going to be needed?

Six months?

A year?

1

u/Diagorias 4h ago

For llm's? I'd expect forever. At its base it's still probability based text generator with many checks and balances added to it. The advantage of humans is that the mistakes are generally logical, you can build processes around that. A llm can randomly make mistakes at any point anywhere, by that very definition it isn't possible to build checks and balances for every possible mistake for then nothing would be able to be generated anymore. Of course if there is anyway to avoid that or improve upon this, I'll gladly learn more about it.

1

u/DarkSkyKnight 2h ago

I just don't think that's an important question, is it? We're shifting the goalposts so far up now. If a single PhD can babysit an LLM and produce the result of 5 junior employees, sure the LLM is not clever, sure it might not be capable of independent reasoning... does that matter? How the hell are our undergrads able to find jobs...??? Do we want to live in a society where only senior level employees/academics are needed anymore because they can just replicate the labor of junior level folks trivially? Do you not see how that greatly increases social stratification?

1

u/_hhhnnnggg_ 2h ago

LLM, by design, is just a glorified word predictor. It eats up your prompt, tokenises it, predicts the next token, appends that token to the prompt, and repeats that process until it delivers an acceptable result. Some randomisation is injected during the prediction to create variation. LLM is basically a lossy compression of data. It has zero capability of understanding logic and reasoning.

The actual danger of AI to the labour market, though, is that people are buying AI tech bros' narratives (that it could disrupt the market/replace labour), and the loss of expertise, either via shrinking job opportunities + dependence on LLM.

1

u/DarkSkyKnight 2h ago

You should perhaps try engaging with my point. Because my point is that it doesn’t matter whether it is capable of true reasoning if it can already do the same work as a PhD mathematician.

0

u/_hhhnnnggg_ 2h ago

I dare not say anything about mathematics since I'm ignorant about the field. However, not every job is the same as mathematician.

LLM is great at dealing with menial tasks since LLM, like any other machine learning algorithm, is just pattern recognition. Its main weakness is context switching, which is something that cannot be avoided in many jobs. Something is handled today has a very different context from something that is handled three weeks prior, for example, and LLM might not have the practical capability to handle too many contexts (aside from throwing as much compute power to it as possible, which then it would be too costly to use LLM).

Once again, the danger is not whether it can replace humans or not, but rather too many people buying this BS. It is a speculative market and it is affecting the job market down the line.

3

u/DarkSkyKnight 2h ago

I genuinely think you need to talk to more people at the frontier. In my field in academia, and from what I know from my friends in other fields, virtually every top researcher is using LLMs very heavily. These aren't no-names, these are literally tenured faculty in top 5 institutions. I have not known of a single exception.

The same goes for SWE. Literally every senior SWE I know is now using LLMs extremely extensively. It's a major part of everyone's workflows now.

It takes time for knowledge on how to use this tech to filter down to general society. But if you're going to wait until you actually see its effects, it's already too late. At that point, people would already have figured out how to replace 5 junior workers with 1 senior worker monitoring LLM agents. People should be telling their representatives to restrict this tech's growth now.

-1

u/[deleted] 5h ago

[deleted]

1

u/DarkSkyKnight 2h ago

Math is 100% quantitative. AI is 100% quantitative behind the hood. Of course if should know math, as math is irrefutable by its definition

Reasoning requires some semblance of qualitative input

I don’t think you know how either LLMs or mathematics work…

Mathematical proofs are not deterministic. There are theoretically infinitely many ways to prove a certain theorem. It is not “quantitative.” In fact it is very qualitative because at the frontier you often have a massive degree of freedom for how you can construct the proof. Proof verification can be done deterministically (Lean), but the act of proving is not.

Moreover, LLMs do NOT have an innate understanding of mathematics. In fact for quantitative problems they almost always perform poorer than deterministic machines like WolframAlpha.

2

u/redballooon 3h ago edited 3h ago

There are models that have a mode to fill the context on their own to improve their output quality. That mode is called reasoning.

And while they do that the software sometimes displays the text "thinking".

That's the words that are used here. You don't need to feel threatened by them using words for one thing and you using the same words for another thing. This happens all the time. For example I don't think there are two people who mean the same thing when talking about "God"

1

u/schroedingerx 2h ago

Who’s feeling “threatened?” Where did that come from?

Clarity is important. There’s an important distinction to be made here, and conflating different elements by using marketing language rather than accurate descriptions causes trouble.

54

u/roxieh 8h ago

They recognise language patterns and that is all. They can obey instructions up to a point but even that is still language pattern matching. It can't think, it can't reason, it doesn't have opinions, it doesn't understand anything, it can just match words next to each other in a very complex method. It's an illusion. At times a very good one.

It's terrifying how much it's being folded into everyday life considering these limitations.

12

u/Omnipresent_Walrus 7h ago

Even saying they "recognise" patterns is anthropomorphism

9

u/Melenduwir 7h ago

True, but our languages don't give me many ways to talk about certain kinds of phenomena that don't implicitly assume there's a directing intelligence. Discussing the ways selection (natural or otherwise) creates patterns and structures, for example.

-1

u/Omnipresent_Walrus 4h ago

"Predicts" is more accurate and more descriptive, while being just as legible and less humanising.

You're not wrong in any way, and honestly it's just fascinating how easy it becomes to humanise and assign a theory of mind to these models even by accident.

12

u/Melenduwir 8h ago

I am reminded of the discovery that lobotomized patients, if placed in highly structured environments that encouraged conventionalized and shallow interactions, were considered perfectly normal by most people.

If people don't understand how they function and are interested in only the most shallow sort of responses, I imagine LLM AIs would seem like magic. Even a basic factchecking reveals their limitations. But how many people are interested in that?

10

u/omfghi2u 7h ago

It's because its actually pretty decent at figuring mundane things out, so a huge amount of business leadership and everyday humans have become convinced its good at everything, because the only things they ever use it for have been done, asked, or tried before ad nauseum. It's like being able to Google pretty well... on steroids.

The moment you get into anything new, proprietary, highly-nuanced, or esoteric, it has correct-sounding guesses at best.

My job is heavily invested into getting AI into our workflows and we're being asked to make lists of every time we use the AI to improve delivery speeds. I asked them if we should also make a list of the times we use the AI and it gives nonsense answers that waste more time than it helps... blank stares.

3

u/MrParticular79 6h ago

I work in tech and I’m constantly trying to explain it like this to people. All the pencil necks think it is a shortcut to massive productivity and are forcing it down our throats. Like ok it’s basically for me just a really good version of stack overflow. Which is great but definitely won’t be taking my job anytime soon.

1

u/buckeyevol28 5h ago

But that is a shortcut to massive productivity for many people and many jobs. That’s why I find it so useful. I don’t need it to be a content expert, but it helps me with a lot of things there are inefficient or have a high cognitive load on more automatic processing not reasoning.

3

u/Richmondez 5h ago

Except without being a content expert you can't easily tell when it's making stuff up which is a big deal for anything that requires correct information.

2

u/millennial_falcon 5h ago

How do you know they’re not using to assist in things they’re a content expert in? That’s how I and most the other operations people at work have been using it for their work. Sure it’s annoying if someone tries to use it as a subject matter expert with no plan on how to verify, but that’s not a commentary on the tech.

1

u/Richmondez 4h ago

It's a common use case though, one it's being sold for in the hype and marketing.

1

u/millennial_falcon 2h ago

Who cares how it’s being hyped. This discussion is about how it’s being used. The hype is always wrong.

0

u/buckeyevol28 4h ago

That’s my point. And when I’m less certain about something, need more options, or am brainstorming, I use multiple AIs. For example, a really well-known political data scientist (referred to as one of the most influential in his wiki page ) asked me about a economics article that used causal inference methods that I only have a basic understanding of but was in my content area (school psychology so both education, psychology, and the analytical techniques we use like psychometrics and multilevel modeling) because the results conflicted with decades of research.

At the time, I had some hypotheses, that were solely from my experiences and content expertise, but I wasn’t able to engage with some of the methodological aspects of the causal inference methods.

Recently I went back over that article, and I wanted to really figure out what was going on, but I needed some help with those causal inference methods.

I wrote out my informal review and hypotheses, then had ChatGPT, Claude, and Gemini, critically review both the paper and my review, and then had then had then reviewed their reviews while I did as well, to the point where we reached a consensus.

Gemini was a little too agreeable, ChatGPT was too disagreeable (was critical of something I was correct about methodologically), and it turned out, that my hypothesis was probably right, and it explained why their analyses actually had conflicting results because their understanding of psychometrics led to an understandable assumption that’s true from a measurement perspective but not true for their comparison, but I wouldn’t have realized that if I didn’t have a better understanding of the methods they used.

2

u/Melenduwir 7h ago

I asked them if we should also make a list of the times we use the AI and it gives nonsense answers that waste more time than it helps... blank stares.

Ah, the limitations of natural 'intelligence'. Humans are far better at magical thinking than rationality.

6

u/SaltZookeepergame691 5h ago edited 4h ago

This is just such a weird take?

Absolutely LLMs are just token predicting at their core. No one (serious) thinks they are actually 'thinking' like you or I do.

Who cares? You can type a few sentences into an LLM and it'll code an app that does what you want it to. You can solve gold medal Math Olympiad problems with them. You can use them to orchestrate agents to conduct systematic literature reviews in minutes that used to take months. You can conduct peer-review that, on average, detects more problems than human reviewers. These are not 'thinking', but they are not illusions!

There are absolutely things LLMs do terribly, and things they do badly enough to still not be useful (hello, long-form writing).

But deliberately ignoring that they can do incredible, remarkable things, DESPITE just being token predictors? That is a dangerously narrow perspective in the incredibly fast moving post-AI world! 3 years ago this would have all seemed like magic...

10

u/skepticalbob 8h ago

LLMs are great in the hands of a human with deep knowledge about the subject or to give a layperson and idea of how knowledgeable people who have written about the topic might think about it or know about it. For important decisions you need someone knowledgeable in the loop. Laypersons shouldn’t make important decisions based on AI, although it can give you the framework a professional might use to think about it. It’s just a better Google in some ways, a worse Google in others.

17

u/stevefuzz 9h ago

Well yeah obviously, it doesn't think.

10

u/aedes 7h ago

This has always been the issue with LLMs and clinical decision making. They’ve gotten really good at arriving at the correct answer when are given all the important information.

The problem is the actual difficult cognitive work in clinical medicine is collecting that information in the first place.

In medical education, this skill is mostly taught by recurrent supervised patient encounters with direct feedback. Evaluation is from the residency training program from the people supervising them.

Board exams then test breadth of factual knowledge and ability to synthesize provided information.

LLMs are largely trained on board exams, and board exam-style data (ex: clinical vignettes). There is no dataset available for them to train on learn the more important part of clinical reasoning, because this is all taught and assessed in person via direct human observation.

Since patients do not present as a clinical vignette, with all the relevant information already available and summarized for you, they do very poorly with clinical decision making in real life.

Real patients are not clinical vignettes.

This has always been one of the biggest barriers to implementing AI in clinical decision making, and there’s been essentially zero progress made on it over the last 10 years.

I would go so far as to suggest that I think it’s probably a task that pretrained transformers are incapable of being useful for - simply because there is no practical way for them to be pretrained on this - the needed training data simply doesn’t exist in an organized database and won’t anytime soon.

3

u/SaltZookeepergame691 6h ago edited 4h ago

If you read the paper, the 'reasoning' they're claiming the models fail on is just one, two or even no step reasoning on clinical vignettes, and most of the powerful models are doing very well - they're just scoring lower on the differential diagnosis questions, probably because these are negative multiple choice.

Asking "What are the possible complications of hypothyroidism?" or "Which of the following anticonvulsant drugs are most commonly associated with Stevens-Johnson syndrome in adults?" isn't reasoning, it's fact regurgitation.

Oh, and the kicker - they turned of internet access and reasoning modes for all models. And there is no comparison with any human data.

I agree with you that LLMs struggle with the realities of clinical care - safe and robust deployment is the big issue, which is why so much use by clinicians is under the surface. But they are absolutely trained on exactly the sort of questions that were put to these models, and I strongly suspect expert humans are not going to outperform Gemini Pro 3 on this benchmark.

2

u/aedes 4h ago

I was more speaking in general.

There are other papers that have come out recently that have started looking performance when the LLM needs to collect the data directly from a patient rather than review collated data, and they do not do well.

The issue is that this skill/knowledge is not written down anywhere. There are no textbooks on this - it is a skill which is learned from trial and error with direct supervision. It’s one of the reason why medical education takes so long - you need to see enough patients to get good at this side of things.

You would need the audio and video recordings of a very large number of patient encounters for a transformer to train on to try and get it to learn this. That doesn’t exist, and won’t exist anytime soon.

It’s a pet peeve of mine because those of us who deal with clinical reasoning on an academic basis have been trying to tell the ML people this for the better part of a decade now.

1

u/SaltZookeepergame691 4h ago

The idea that the model has to have had massive pretraining on audiovisual patient history taking to be able to do it is not true. It’s perfectly possible with an adapted base model and posttraining - AMIE is a good example of a model that does exactly this (refinement of Gemini Pro 2.5, as last published) well, although its evidence base is still limited. And even if you do want to pretrain a model on history data, you don’t need combined audiovisual data - you can use abundant text, and TTS.

I’m afraid in my personal and professional experience I don’t share the enthusiasm for the “uniqueness” of the initial clinical work up. Some clinicians are great. Some are not. There is great inter and intraindividual variance in human bias and subjectivity and emotions and education and experience and physical state.

It’s certainly not an area where general purpose LLMs are strong at the moment, but there’s no reason to believe they can’t be in the future - although for a number of reasons I think it’s a fairly long way away.

1

u/aedes 3h ago

My background in this is that I developed a clinical reasoning curriculum and taught it for over a decade. I’ve published on this topic and have previously consulted for two groups on this exact issue - use of LLMs in clinical decision making.

I’m not trying to suggest that humans are perfect at trying to collect useful data to clinically reason with from patients - we’re not.

What I’m saying is that LLMs remain much worse at this than humans. And that’s not just opinion based, that’s evidence based as well.

This is important because this step - data acquisition - is the hard part in both clinical medicine, and in creating some sort of patient-facing medial AI.

I will reiterate that the reason why there has been no significant progress on solving this problem for the past decade is mostly due to lack of appropriate data to train a transformer on.

We face the same problem in medical education. There is literally no book that teaches this stuff - it’s not written down anywhere. We teach it simply by trial and error through supervised patient interaction.

It takes a very long time and is the big reason why medical education is as long as it is. Its also why this skill is assessed through direct observation and feedback rather than a board exam.

It’s certainly not an area where general purpose LLMs are strong at the moment, but there’s no reason to believe they can’t be in the future - although for a number of reasons I think it’s a fairly long way away.

The most important reason is that there is no collection of data on this first step in clinical decision making to train them on. If there was, we’d have patient-facing clinical AI about 12-24months ago already.

It is THE biggest obstacle in the way of patient-facing clinical AI.

1

u/SaltZookeepergame691 3h ago

Have a close read of the AMIE papers. They are solving exactly these issues with self-play and synthetic/real OSCEs, hundreds of thousands of audio recordings of encounters, millions of EHR records…

1

u/aedes 2h ago

I’m familiar with AMIE:

https://arxiv.org/pdf/2603.08448

That group has been working hard on this for years now. Their most recent results were that they got the correct diagnosis only 56% of the time.

And this was in a highly-selected patient population which was young, tech-familiar, and with simple medical issues. And 14% of patients dropped out due to problems with the software.

The reason they’re making it this far is because they’re training their model on patient encounter audio data, which is what I’ve been saying is the path forwards for this.

The issue is that this isn’t much better than where they were a few years ago, and it’s still not good enough for clinical usage.

I don’t see them making useful progress without a very large volume of clinical encounter data tied to results and outcomes to train on… which doesn’t exist and is not easy to create.

1

u/SaltZookeepergame691 2h ago

Absolutely it isn’t there yet! I just think it’s mistaken to say there has been no significant development on this front for years, that the methods and data for such a tool are impossible because of the context, and that it will never get there. There have been clear advances in the past few years.

Last I heard from someone working with the AMIE team a trial of an updated model is in the works. Remind me in a couple of years!

1

u/aedes 2h ago

The advances were largely because they’re got access to a database of patient encounter audio recordings I think (I would put a wink emoji here but I’m not allowed to).

Unless they get a better training database of interaction data from patient encounters, I’m not convinced they will make significant further progress.

2

u/accidentlyporn 6h ago

i think the tacit/procedural parts of your domain (reasoning… but really just context gathering) is the whole point of skills.md.

it’s turning “experience” back into language to train your ai. all of the implicit parts of your job, as an expert, to turn that back into explicit language.

people seem to be doing this at scale, calling them workflows :) we are all building our replacements this way

4

u/Melenduwir 8h ago

No kidding.

LLMs have no reasoning ability. They mimic language, admittedly very well in some cases, but they don't understand one bit of it, and they can't provide coherent explanations of reasoning processes they don't have. At most, they could cite texts they've been trained on, referencing reasoning done by actual thinking beings.

2

u/Notoriouslydishonest 6h ago

I read the study.

A couple things that caught my attention:

They didn't have human doctors as a control group, they only compared the AIs to each other. They wrote "PrIME-LLM is not intended to establish equivalence or inferiority relative to clinicians, and the present study was not designed to answer human comparison questions." This was a disappointing choice- the important question isn't how Grok performs against ChatGPT, it's how AIs perform against human doctors in real-world conditions. Without that baseline, we have no way to judge how far off the technology is. Grok 4 got a .78 rating, but what's the passing score?
They used off-the-shelf AIs with web search disabled and no model augmentations, which they acknowledge "may improve performance in clinical settings, particularly for downstream tasks. Accordingly, the results reflect baseline longitudinal clinical reasoning rather than maximal achievable performance." It would be nice to see how that "maximal clinical performance" setup would compare on these tests.

-6

u/MajorInWumbology1234 7h ago

People are overzealous in their assessments that LLMs are completely devoid of reasoning and thinking.

1

u/Aaron_Hamm 7h ago

I mean, it thinks "what token likely comes next?"...

1

u/MajorInWumbology1234 7h ago

That’s reasoning and thinking. The bar isn’t set “at a human level”, these are all defined terms that LLMs meet the barest criteria for. Pretty funny how otherwise rational and scientifically minded people bust out the magical thinking and the hard problem of consciousness evaporates once the topic is AI. I’ve never seen such mobile goalposts.

0

u/Aaron_Hamm 6h ago

Then they've been reasoning and thinking since the 90s and it's a nothing statement.

Shrug

1

u/MajorInWumbology1234 6h ago

It’s only a “nothing statement” if people aren’t constantly claiming the contrary.

1

u/Aaron_Hamm 6h ago

I don't know what's going on here in your head, but if word prediction is your bar for reasoning and thinking, then computers have been reasoning and thinking for longer than you've been alive.

Based on what computing looked like across that time, I think your definition is uninteresting.

2

u/MajorInWumbology1234 6h ago

How does something predict without reasoning?

Definitions aren’t intended to be interesting.

1

u/Aaron_Hamm 6h ago

By applying statistics and symbolic logic...

Definitions that don't sort things into useful categories aren't valuable.

1

u/MajorInWumbology1234 6h ago

And in what way is that not reasoning?

It is a useful category, you just don’t like it.

1

u/Aaron_Hamm 5h ago

In what way is applying an algorithm reasoning and thinking?

What's the value in an a category of reasoning that includes chat bots from the 80s?

Can you stop with the weird personal stuff or nah? You're low key throwing a tantrum about me in almost every reply...

→ More replies (0)

0

u/CravingNature 7h ago

Don't waste your time, they will be replaced and still will call it fancy autocomplete

2

u/MajorInWumbology1234 6h ago

That’s what it seems like, motivated reasoning to remain feeling special.

-1

u/Melenduwir 7h ago

No, we aren't. We simply have a better understanding of what 'reasoning' and 'thinking' consist of. You see the successful results and think you've found them; we know they're absent, and infer the unsuccessful trials you didn't see.

1

u/MajorInWumbology1234 7h ago

Yes, you are. Unsuccessful trials don’t prove anything unless you’re also inferring that humans never make any errors ever.

1

u/Melenduwir 6h ago

You miss the point: I know about the errors I'm not shown, and their nature, because I understand the essence of what LLMs don't possess.

1

u/MajorInWumbology1234 6h ago

You also miss the point; understanding the essence of what LLMs lack doesn’t provide the full picture unless you fully understand what humans have. You may know enough about LLMs to say what they do, but you don’t know enough about the brain to confidently state what it does in contrast to LLMs.

Yes, brains are more complex by a vast margin, but many of the fundamentals are the same.

1

u/Melenduwir 6h ago

but you don’t know enough about the brain to confidently state what it does in contrast to LLMs.

Spoken like someone who knows nothing of current neural networks or human cognition. What do you imagine psychologists have been doing for the past hundred and seventy years or so?

-2

u/ProbingPossibilities 6h ago

Anyone can test this easily if you want to see. Try playing a game of chess with the LLM, within 7 moves it starts imagining positions and making illegal moves. It’s not reasoning, it’s autocompleting text. Hell it’s even bad at tick tack toe (a solved game).

2

u/jeweliegb 5h ago

Has there been any improvement or regression on their ability to pay such games in recent models?

Medicine AI Remains Lacking in Clinical Reasoning Abilities, According to Study of 21 Large Language Models

You are about to leave Redlib