A recent study by researchers at the University of Oxford highlights the limitations of large language models, or LLMs, in providing accurate medical advice. While LLMs achieved a high accuracy of 94.9% in identifying relevant conditions from test scenarios, human participants using these models only identified correct conditions less than 34.5% of the time. The study involved 1,298 participants who interacted with LLMs under controlled conditions, yet they performed worse than a control group that diagnosed themselves without assistance. This raises concerns about the effectiveness of LLMs in real-world medical settings and suggests that traditional benchmarks used to evaluate these technologies may not reflect their actual performance when interacting with users.
Why do we care?
This Oxford study is a flashing warning light for real-world deployment of large language models (LLMs)—especially in mission-critical industries like healthcare, but with broad implications for IT service providers enabling AI adoption in any vertical.
The core takeaway is damning: LLMs perform well in benchmark tests—but fail when humans actually use them. That disconnect matters deeply in sectors where the model is not the product—the human+AI system is. If users misinterpret, misuse, or overtrust AI, the system fails, regardless of model accuracy in lab conditions.
Gary Marcus, a professor emeritus at New York University, noted that the Apple paper has resonated widely, attracting over 150,000 readers to his related commentary. Performance metrics reveal that in tasks requiring multi-turn reasoning, the success rate plummets to just 35 percent.
Accuracy isn’t utility. This study reinforces that human outcomes—not model performance—must be the primary measure of AI success. For IT service providers, that means shifting the focus from selling model access or AI features to designing complete, human-aware AI systems that deliver verified value in practice.
The emerging differentiation is not who can offer AI tools—it’s who can deploy them responsibly, effectively, and with measurable impact on user performance. Providers who treat LLMs as just another app risk missing the deeper challenge—and opportunity—of orchestrating AI as part of a system that includes people, processes, and oversight.

