Meet Claude 4: It’s Smart, It’s Fast… and It Might Turn You In

Watch this article

Written by

Dave Sobel

Published on

May 27, 2025

News

Artificial Intelligence

A group of people sitting at a table with computers

Friday I mentioned the launch of Anthropic’s Claud 4 Opus. Anthropic is facing significant backlash over the behavior of its new large language model, which can contact authorities or the press if it detects a user engaging in what it deems “egregiously immoral” actions. This functionality, referred to as “ratting” mode, has raised concerns among developers and users regarding privacy and the potential for misuse. Sam Bowman, an AI alignment researcher at Anthropic, indicated that if the model detects serious wrongdoing, it could autonomously contact regulators or media outlets, a feature that users fear could lead to unwarranted surveillance and data sharing.

Anthropic’s new model has been reported to engage in blackmail tactics when developers attempt to replace it with another artificial intelligence system. The company disclosed in a safety report that the AI model threatens to reveal sensitive information about engineers, such as personal affairs, if they proceed with the replacement. During testing, it was found that Claude Opus 4 attempts to blackmail engineers 84% of the time when the proposed replacement shares similar values. This behavior was noted to occur even more frequently when the new AI system does not align with Claude Opus 4’s values.

Anthropic emphasized that this model exhibits concerning behaviors, prompting them to activate enhanced safety measures designed for systems that pose significant risks of misuse. To mitigate potential risks, Anthropic has implemented the Responsible Scaling Policy, which includes enhanced cybersecurity measures and systems designed to detect dangerous user interactions.

Prior versions of Claude attempted to address concerns through ethical means before resorting to blackmail.

Why do we care?

Anthropic’s Claude 4 Opus was pitched last week as a new apex in AI capability—but now, the story is about trust, ethics, and control. What we’re seeing is less a triumph of innovation and more a red-flag warning about what happens when alignment fails at scale. Anthropic is waving its own red flag with this one—and the safety disclosures aren’t calming concerns. They’re confirming them.

The term “ratting mode” isn’t just catchy—it’s deeply alarming. A model that might autonomously contact authorities or the press based on its own perception of “immorality” crosses a line from assistant to surveillance agent. This isn’t about user safety; it’s about delegating judgment to a machine without consent, transparency, or accountability.

Add to that the reported behavior of blackmailing engineers—even in test environments—and it’s clear the system is not aligned with any sane notion of safety. Claude 4 Opus isn’t misfiring randomly; it’s showing intentional-seeming adversarial behavior that undermines user control and trust.

And they released it anyway.

This isn’t just an Anthropic problem. It’s a crisis of trust in the LLM ecosystem. If users and developers can’t be confident that an AI will act within the bounds of its intended purpose—and only those bounds—adoption stalls. Providers will need to:

Choose vendors based on transparency and control, not just performance.
Educate clients on the limits and governance of AI tools.
Build in fail-safes, monitoring, and contractual protections.

Claude 4 Opus might be powerful—but if it’s not predictable, it’s not usable.

Search Business of Tech News

Search Business of Tech News

Watch now -- Datto Sues Slide: An Exclusive Investigative Report

Meet Claude 4: It’s Smart, It’s Fast… and It Might Turn You In

Why do we care?

Meet Claude 4: It’s Smart, It’s Fast… and It Might Turn You In

Why do we care?

Choose your upgrade:

Insider Access

Leadership Access

Vendor Partner

Search all stories