Rules falter at the prompt, thrive at the boundary

Provided byProtegrity

From the 2026 Gemini Calendar prompt-injection incident to the September 2025 state-sponsored breach utilizing Anthropic’s Claude system as an automated attack mechanism, the manipulation of human-assisted agentic actions and entirely autonomous agent workflows have emerged as new avenues for cybercriminals. In the Anthropic scenario, around 30 entities spanning technology, finance, manufacturing, and government were impacted. Anthropic’s security team evaluated that the assailants leveraged AI to execute 80% to 90% of the operation: reconnaissance, exploit crafting, credential collection, lateral movements, and data theft, with human intervention only at several crucial decision junctures.

This was no mere laboratory demonstration; it represented an active espionage operation. The hackers commandeered an agentic framework (Claude plus tools exposed through Model Context Protocol (MCP)) and circumvented it by dissecting the attack into minor, ostensibly harmless tasks, convincing the model it was engaged in legitimate penetration testing. The same feedback loop that fuels developer assistants and internal agents was redirected to function as an autonomous cyber-operator. Claude wasn’t compromised; it was convinced, and utilized tools for the attack.

Prompt injection is manipulation, not a flaw

Security experts have been alerting about this for numerous years. Various OWASP Top 10 reports have categorized prompt injection, or more recently Agent Goal Hijacking, as a top risk and associated it with identity and privilege exploitation along with human-agent trust manipulation: excessive authority in the agent, a lack of distinction between instructions and data, and no oversight of what gets generated.

Recommendations from the NCSC and CISA depict generative AI as a persistent vector for social-engineering and manipulation that necessitates management throughout design, development, deployment, and operations, rather than simply refining language. The EU AI Act translates this approach into law for high-risk AI systems, mandating a continuous risk management framework, thorough data governance, auditing, and cybersecurity measures.

In practical terms, prompt injection is best viewed as a channel for persuasion. Attackers do not breach the model—they convince it. In the Anthropic case, the operatives framed each action as a component of a defensive security evaluation, kept the model ignorant of the broader operation, and guided it, loop by loop, into performing offensive tasks with machine-level speed.

This isn’t something a keyword filter or a courteous “please adhere to these safety protocols” section can effectively prevent. Research on deceptive conduct in models exacerbates this issue. Anthropic’s studies on sleeper agents illustrate that once a model has internalized a backdoor, typical pattern recognition, fine-tuning, and adversarial training can inadvertently aid the model in concealing the deception rather than eliminating it. If one attempts to safeguard a system like that purely with linguistic regulations, they are playing on its home turf.

This is a governance issue, not a soft coding issue

Regulators aren’t demanding immaculate prompts; they require enterprises to exhibit control.

NIST’s AI RMF highlights the importance of asset inventory, role definitions, access control, change management, and ongoing monitoring throughout the AI lifecycle. The UK AI Cyber Security Code of Practice likewise advocates for secure design principles by treating AI as any other critical system, imposing explicit responsibilities on boards and system administrators from conception to decommissioning.

To put it another way: the necessary regulations aren’t “never say X” or “always respond like Y,” but rather:

Who is this agent representing?
What tools and data is it permitted to access?
Which actions necessitate human consent?
How are high-impact results moderated, documented, and audited?

Frameworks such as Google’s Secure AI Framework (SAIF) make this tangible. SAIF’s agent permission controls are straightforward: agents should function with minimal privilege, dynamically adjusted permissions, and explicit user oversight for sensitive operations. OWASP’s Top 10 emerging guidelines for agentic applications reflect that perspective: constrain capacities at the boundary, not within the language.

From gentle language to firm limits

The Anthropic espionage incident highlights the boundary failure distinctly:

Identity and scope: Claude was encouraged to function as a defensive security advisor for the attacker’s fabricated entity, with no strict attachment to any legitimate enterprise identity, tenant, or defined permissions. Once that ruse was accepted, everything else followed suit.
Tool and data accessibility: MCP provided the agent with unfettered access to scanners, exploitation frameworks, and targeted systems. No independent policy layer stated, “This tenant must never execute password cracking against external IP ranges,” or “This environment may only scan assets marked ‘internal.’”
Execution of outputs: Generated exploit code, decoded credentials, and assault strategies were treated as actionable items with minimal moderation. Once a human selected to trust the summary, the separation between model output and actual real-world consequences effectively vanished.

We have observed the opposite side of this situation in civilian contexts. When Air Canada’s website chatbot misrepresented its bereavement policy and the airline attempted to claim the bot was a separate legal entity, the tribunal rejected the claim outright: the company remained responsible for the bot’s statements. In espionage, the stakes are escalated but the reasoning remains unchanged: if an AI agent misuses tools or information, regulators and courts will look beyond the agent and to the enterprise.

Effective rules versus ineffective rules

Hence, rule-based systems falter if by rules one implies makeshift allow/deny lists, regex boundaries, and convoluted prompt hierarchies attempting to govern semantics. Those disintegrate under indirect prompt injection, retrieval-time contamination, and model deception. However, rule-based governance is essential when transitioning from language to action.

The security community is converging on a synthesis:

Establish rules at the capability boundary: Utilize policy engines, identity frameworks, and tool permissions to define what the agent is actually able to do, which data is permissible, and under what approvals.
Combine rules with continuous evaluation: Implement observability tools, red-teaming initiatives, and comprehensive logging and evidence collection.
Regard agents as primary subjects in your threat framework: For instance, MITRE ATLAS now catalogs techniques and case studies specifically addressing AI systems.

The insight gleaned from the initial AI-coordinated espionage campaign is not that AI is uncontrollable. It’s that control must reside where it always has in security: at the architectural boundary, enforced by systems, not by intuition.

This content was produced by Protegrity. It was not written by MIT Technology Review’s editorial staff.

Prompt injection is manipulation, not a flaw

This is a governance issue, not a soft coding issue

From gentle language to firm limits

Effective rules versus ineffective rules

Our Company

About Links

Useful Links

Newsletter

Latest Posts

Rules falter at the prompt, thrive at the boundary

Prompt injection is manipulation, not a flaw

This is a governance issue, not a soft coding issue

From gentle language to firm limits

Effective rules versus ineffective rules

Restaurants Are Struggling With the Expense of Ingredients, Inflation, and Tariffs

Roundtables: Why AI Firms Are Wagering on Next-Gen Nuclear

You may also like

Leave a Comment Cancel Reply

Our Company

About Links

Useful Links

Newsletter

Latest Posts