Faulty AI Tests and Failing Trust: Oxford Study and Kaseya Report Expose the Gap Between Hype and Reality

Watch this article

Written by

Dave Sobel

Published on

November 11, 2025

News

Artificial Intelligence

a computer chip with the letter a on top of it

A recent study by the Oxford Internet Institute and other organizations reveals that only 16 percent of 445 benchmarks used to evaluate large language models in natural language processing employ rigorous scientific methods. The findings highlight significant flaws in current evaluation practices, including vague definitions of key concepts like reasoning and a reliance on convenience sampling for data selection. Lead author Andrew Bean emphasized the importance of clear definitions and sound measurement in ensuring that AI advancements are genuinely meaningful, rather than merely superficial. This study raises concerns about the reliability of benchmark scores, which underpin many claims about AI capabilities, including those made by companies like OpenAI regarding their GPT-5 model.

Small and medium-sized businesses (SMBs) are approaching artificial intelligence with caution in their cybersecurity strategies, despite the ongoing threat posed by human error. According to research conducted by Kaseya, only 12% of businesses trust AI to operate autonomously, and 18% do not utilize AI at all for enhancing security measures. The primary concerns hindering AI adoption include accuracy and data privacy. Furthermore, human error remains the leading vulnerability, driven by inadequate training and poor user practices. Kaseya’s findings indicate that phishing is the most prevalent cyber threat, impacting 56% of respondents, with nearly half experiencing incidents in the past year. Alarmingly, only 40% of businesses have a regularly tested incident response plan, leaving many unprepared for breaches that can lead to significant operational and financial losses. As organizations prioritize cybersecurity, it remains the top investment focus for 52% of respondents.

Why do we care?

Let’s cut through the noise: AI benchmark numbers? Mostly garbage.

Oxford’s researchers found only 16% of tests for large language models actually use sound scientific methods. So when a vendor says their model “beats GPT-5 on reasoning,” odds are it means nothing measurable.

And it shows. Kaseya found SMBs don’t trust AI to run security alone—just 12% do. The rest are worried about accuracy and privacy, and rightly so. Phishing is still the biggest threat, and most companies haven’t even tested their response plans.

Here’s the takeaway for IT providers: don’t sell hype, sell validation.

Run your own tests. Demand real benchmarks from vendors. Keep humans in the loop. Because the real risk isn’t AI itself—it’s trusting what you can’t verify.

This is where providers win: by being the layer that turns AI promise into dependable performance.

Search Business of Tech News

Search Business of Tech News

Watch now -- Datto Sues Slide: An Exclusive Investigative Report

Faulty AI Tests and Failing Trust: Oxford Study and Kaseya Report Expose the Gap Between Hype and Reality

Why do we care?

Faulty AI Tests and Failing Trust: Oxford Study and Kaseya Report Expose the Gap Between Hype and Reality

Why do we care?

Choose your upgrade:

Insider Access

Leadership Access

Vendor Partner

Search all stories