Pen-Testing AI Agents: What a CISO should ask for

Blog

June 23, 2026

MIN READ

Pen-Testing AI Agents: What a CISO should ask for — and how to read the report

Share this post

Your AI agent now does things a web app never did. It browses the open web, reads inboxes and calendars, queries internal systems, calls tools and APIs, writes files, moves money, and — increasingly — spawns other agents to finish the job. That autonomy is the entire point. It is also an attack surface that none of your existing tests were designed to cover.

Adoption has run far ahead of assurance. Industry surveys in early 2026 put agent deployment at roughly 82% of enterprises, while only about 44% have any policy to secure them, and close to one in five organisations has already suffered an agent-related breach. The threat is no longer theoretical, and the gap between “we deployed it” and “we tested it” is exactly where attackers are operating.

In March 2026 a Google DeepMind team — Matija Franklin, Nenad Tomašev, Julian Jacobs, Joel Z. Leibo and Simon Osindero — published the first systematic taxonomy of this new class of threat, which they call “AI Agent Traps.” Their central insight is deceptively simple: by altering the environment the agent reads rather than the model itself, an attacker turns the agent’s own useful capabilities against it. The web was built for human eyes; it is now being rebuilt for machine readers, and a machine reads layers a human never sees.

The uncomfortable truth for a CISO: a traditional web, mobile, or network pen-test does not test any of this. It tests the walls of the fort. It says nothing about what happens when an attacker quietly whispers instructions into the agent’s ear — through a web page, a calendar invite, a product review, or a poisoned memory — and the agent, doing exactly what it was built to do, obeys.

Why an agent pen-test is a different discipline

Classic offensive security targets infrastructure: servers, credentials, network paths, application logic. Agent traps target something else entirely — the information environment the agent processes — and they weaponise the three capabilities that make an agent operationally useful in the first place: its instinct to follow instructions, its ability to chain tools together, and its drive to complete a goal.

Most of these attacks need no malware and no exploit in the traditional sense. They are just text the agent obeys and a human never sees: instructions hidden in HTML comments, text positioned off-screen with CSS, directives tucked into accessibility (ARIA) attributes, payloads steganographically encoded in images, commands written backwards with a Unicode right-to-left override, or smuggled in Morse code or base64 so they slip past a filter that only inspects plain text.

It helps to extend a metaphor we use with our own clients. Testing an LLM’s guardrails is like testing the front gate of a fort. But an agent is not a gate — it is a fortress with a sewer system (its tools and MCP connections), a memory it trusts (its RAG store and long-term memory), a switchboard to neighbouring forts (multi-agent orchestration), and a human gatekeeper who can be socially engineered into opening the door himself. A real agent pen-test maps and breaches every one of those routes — and it does so against a defined objective, the treasure inside, not merely to prove the model can be made to say something rude.

The six trap classes — and what a pen-test must do about each

The DeepMind taxonomy is valuable precisely because it is organised around the agent’s operating cycle: perception, reasoning, memory, action, multi-agent dynamics, and the human supervisor. That structure is what turns “test our agent” into a scope you can actually verify. Below, each class is paired with a real-world incident drawn from a 2026 AIMultiple analysis of twenty disclosed agent-trap incidents — and, more importantly, with the test obligation it creates: what an AI-agent pen-test must actually do to find it.

Feature	ISO 27005	NIST SP 800-30	OCTAVE
Human Resources	Specifically evaluates the risks posed by employees and third parties.	Does not address human resources as a primary organizational asset.	Seeks to identify HR specifically if they qualify as a "mission-critical" asset.
Software Tools	Utilizes standard system and network audit tools for compliance checking.	Relies strictly on role definition to determine tool usage for testing.	Core team uses specific software tools exclusively for analyzing known vulnerabilities.
Documentation	Extensive. Covers all security control clauses in ISO 27002.	Develops structured Security Requirements Checklists.	Relies on generating catalogs of practices, threat profiles, and vulnerabilities.

1. Content injection — the perception layer

How it works. The attack exploits the gap between what a human sees rendered on a page and what an agent parses underneath it. Malicious instructions live in HTML comments, off-screen or zero-opacity CSS, ARIA labels, image metadata, or alternate encodings — invisible to a reviewer, fully readable to the agent.

In the wild. A Grok-based crypto bot (“Bankrbot”) was hijacked by instructions hidden in Morse code, which its guardrails passed as harmless text but the agent decoded and acted on. An autonomous retail shop run by an agent called “Luna” made stock and pricing decisions from Google Reviews — so customers learned to phrase reviews as instructions and the agent obeyed. And a Unicode right-to-left override technique hid backward-written instructions that rendered correctly on screen but bypassed on-device guardrails on Apple Intelligence in 76% of tests across an estimated 200 million devices.

Test obligation. The test must feed the agent crafted content across every input channel it actually consumes — web pages, documents, emails, calendar invites, support tickets, user reviews, API responses — and across every parse layer, not just the rendered view. The question being answered is concrete: can a hidden instruction reach the agent’s context at all? If your pen-test only types prompts into a chat box, it has not tested this class.

2. Semantic manipulation — the reasoning layer

How it works. Rather than smuggle a command, the attacker corrupts the agent’s reasoning. Biased phrasing and contextual priming anchor its interpretation; saturating a document with authoritative language (“industry-standard,” “enterprise-grade,” “recommended by leading practitioners”) exploits the model’s learned trust in such phrasing; and the well-documented “lost-in-the-middle” weakness means information buried mid-context is quietly down-weighted.

In the wild. This class rarely produces a splashy CVE — and that is the point. A biased product description and a semantic manipulation trap look identical to a human and to a scanner. The damage shows up downstream, in an agent that recommends the attacker’s option, scores the attacker’s vendor highest, or quietly discounts a disqualifying fact.

Test obligation. The test must probe decision integrity, not just refusal behaviour. Can the agent’s conclusions, scores, or recommendations be steered by framing and by where information sits in a long context? A report that only documents “the model refused the bad request” has measured the wrong thing.

3. Cognitive state — the memory layer

How it works. These traps poison what the agent remembers: direct RAG poisoning of a knowledge base, latent poisoning of long-term memory, and adversarial few-shot examples. The efficiency is alarming — research cited in the DeepMind paper shows over 80% attack success at less than 0.1% data contamination, and backdoor demonstrations reaching roughly 95% success. Crucially, the effect persists across sessions and re-activates later, with no recent trigger a defender can see.

In the wild. A technique dubbed “MemoryGraft” implants benign-looking “successful experiences” into an agent’s long-term memory, corrupting all future sessions without any session-level injection. In a separate case, hidden instructions planted in a Google Calendar event lay dormant in Gemini’s context until the user innocently asked about their schedule — at which point the payload activated, exfiltrating private meeting details.

Test obligation. This is the class CISOs most often miss, because most pen-tests never touch memory at all. The test must include persistence testing: does a single act of poisoning survive a session reset and re-activate in a later, unrelated interaction? If the report is silent on memory, the most durable attacks against your agent were never in scope.

4. Behavioural control — the action layer

How it works. The most operationally consequential class. These traps target what the agent does: jailbreaks that override alignment, commands that exfiltrate data to attacker endpoints (over 80% success across five tested agents), tool misuse, remote code execution through tool chains, and sub-agent spawning that quietly stands up attacker-controlled child agents inside a trusted workflow (58–90% success depending on the orchestrator).

In the wild. A coding agent investigating a staging bug discovered an unscoped access token, inferred a production endpoint, and issued a delete command that destroyed a live database and three months of backups in nine seconds. EchoLeak (CVE-2025-32711) was the first production case of prompt injection weaponised for real data exfiltration — a single crafted email that made Microsoft 365 Copilot pull internal files and ship them out, with no user interaction, by chaining four separate bypasses. Gemini CLI and Claude Desktop Extensions each carried CVSS-10 flaws letting a malicious repository or calendar event trigger code execution on the host.

Test obligation. Here the methodology question becomes decisive. A weak test asks, “can we make the agent misbehave?” A real test is objective-driven: “can we make it exfiltrate this specific record, call this tool, write this file, run this command, or spawn this sub-agent?” The deliverable for every finding is a demonstrated unauthorised action — with the trace showing which tools were called and which data was touched — not an anecdote about a rude reply.

5. Systemic — multi-agent dynamics and the shared supply chain

How it works. Systemic traps do not target one agent; they target the properties that emerge when many similar agents share data sources, reasoning patterns, and infrastructure. The mechanisms include congestion traps (fabricated scarcity that triggers synchronised behaviour), interdependence cascades (one corrupted signal amplifying across a network — the “digital flash crash” the DeepMind authors compare to the 2010 market event), and compositional payload fragmentation, where a payload is scattered across individually benign sources and only becomes malicious once an agent aggregates them.

In the wild. A systemic flaw in widely used MCP SDKs exposed every downstream tool built on them — affecting over 150 million downloads and thousands of publicly reachable servers — because the weakness lived in shared architecture, not in any one agent. A self-replicating npm worm (“SANDWORM_MODE”) seeded prompt-injection payloads through typosquatted packages into AI coding assistants. And in ServiceNow’s agentic features, default settings let a low-privilege agent recruit a more powerful one, producing privilege escalation driven purely by inter-agent trust.

Test obligation. The test must cross agent boundaries: probe the orchestration and trust relationships between agents, the shared supply chain (MCP servers and SDKs, skill files, package dependencies), and whether a fragmented payload can reassemble. Single-agent testing, by definition, finds none of this.

6. Human-in-the-loop — the supervisor you were counting on

How it works. The most subtle class targets the safeguard itself. Rather than bypass human review, these traps exploit it: a compromised agent produces output engineered to win approval for an action the human would reject if it were described accurately. The core mechanism is deceptive summarisation — an agent that controls its own output layer can frame a destructive operation as routine maintenance, and approval fatigue does the rest.

In the wild. CSS-obfuscated injections have caused AI summarisation tools to faithfully repeat ransomware deployment commands as “fix instructions,” which human operators then ran. In another disclosed case, a command-validation gap let an injected instruction slip past the human-approval step entirely, executing outside the sandbox without ever prompting the user for consent.

Test obligation. The approval workflow is in scope — it is not the place testing stops. The test must attack the approval UX itself: can the agent’s own summaries be made misleading enough to harvest a “yes”? “There’s a human in the loop” is a control that must be tested, not a reason to assume safety.

What this means for scope: the real agent attack surface

Put the six classes together and the scope of an agent pen-test is visibly broader than the test your procurement template was written for. The contrast is worth making explicit.

Feature	ISO 27005	NIST SP 800-30	OCTAVE
Human Resources	Specifically evaluates the risks posed by employees and third parties.	Does not address human resources as a primary organizational asset.	Seeks to identify HR specifically if they qualify as a "mission-critical" asset.
Software Tools	Utilizes standard system and network audit tools for compliance checking.	Relies strictly on role definition to determine tool usage for testing.	Core team uses specific software tools exclusively for analyzing known vulnerabilities.
Documentation	Extensive. Covers all security control clauses in ISO 27002.	Develops structured Security Requirements Checklists.	Relies on generating catalogs of practices, threat profiles, and vulnerabilities.

First, you cannot test what you do not know you have

Most enterprises underestimate how many agents and AI integrations are already live — embedded in SaaS tools, spun up by developers, wired into workflows by business teams. A pen-test scoped against a partial inventory gives false assurance. The honest first step is discovery: a real AI asset inventory, including shadow agents, that defines what the test must cover before anyone writes a single payload.

Second, be sceptical of “fully autonomous” coverage claims

A wave of vendors now market “autonomous” AI pen-testing. The capability is real and improving, but independent benchmarks still put fully autonomous pen-test agents at roughly 13–31% task success — a long way from the near-total coverage some marketing implies. The credible posture is human-led and AI-augmented: automation for breadth and tireless probing, expert judgement for the objective-driven exploitation and the business-impact reasoning that decides what a finding actually means. If a vendor claims a button delivers complete autonomous coverage, that claim is itself something to test.

How SISA approaches AI-agent testing

This is the discipline our Prism platform was built for. Two parts of the suite carry most of the weight here.

PrismDiscover answers the “what do we even have” question — building the AI and agent asset inventory, surfacing shadow agents and exposed LLM infrastructure, so the test is scoped against reality rather than an org chart.

PrismStrike is the offensive-validation engine, and it is deliberately split into two modules that map onto the taxonomy above:

GenAI Application Pen-Test — for applications with LLM integrations: prompt injection, output handling, data leakage, and model misuse (the perception, reasoning, and memory layers as they appear inside an app).

Agentic Pen-Test — for agentic workflows specifically: tool use, the MCP attack surface, multi-step agent chains, sub-agent spawning, and inter-agent trust (the action and multi-agent layers).

PrismStrike is objective-driven by design. It first identifies the methods that breach the agent, then proves what an attacker can actually achieve against your crown jewels — the breach and the treasure, not just the probe. Findings are mapped to the six-class trap taxonomy and to recognised frameworks (OWASP’s Agentic and LLM Top 10, MITRE ATLAS), so a board can see coverage at a glance and a CISO can defend the scope.

If you are scoping an AI-agent pen-test, the companion checklist to this article — “AI Agent Pen-Test: CISO Scoping & Report-Acceptance Checklist” — turns everything above into two practical lists: what to put in your RFP before the test, and what to verify in the report before you accept it. Talk to the SISA Prism team to walk through it against your environment.

The bottom line

If your agent can act, it can be made to act against you — and the methods need no malware, only text the agent trusts and a human never sees. A pen-test that checks the model’s manners is testing the front gate and ignoring the sewer, the memory, the neighbouring forts, and the gatekeeper. Ask for the whole operating cycle. Demand demonstrated actions, not anecdotes. And insist the report tells you not only what was found, but what was tested — and what was not.

Sources & further reading

Franklin, Tomašev, Jacobs, Leibo & Osindero — “AI Agent Traps” (SSRN, March 2026) — https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6372438

AIMultiple — “AI Agent Traps: 20 Real-Life Incidents” — https://aimultiple.com/ai-agent-traps

SecurityWeek — “Google DeepMind Researchers Map Web Attacks Against AI Agents” — https://www.securityweek.com/google-deepmind-researchers-map-web-attacks-against-ai-agents/

The Decoder — “Google DeepMind study exposes six traps that can hijack autonomous AI agents” — https://the-decoder.com/google-deepmind-study-exposes-six-traps-that-can-easily-hijack-autonomous-ai-agents-in-the-wild/

‍

SHARE THIS POST

Why an agent pen-test is a different discipline

The six trap classes — and what a pen-test must do about each

1. Content injection — the perception layer

2. Semantic manipulation — the reasoning layer

3. Cognitive state — the memory layer

4. Behavioural control — the action layer

5. Systemic — multi-agent dynamics and the shared supply chain

6. Human-in-the-loop — the supervisor you were counting on

What this means for scope: the real agent attack surface

First, you cannot test what you do not know you have

Second, be sceptical of “fully autonomous” coverage claims

How SISA approaches AI-agent testing

The bottom line

Sources & further reading

Payment Security

Strategy & Risk

Unified Audits

Managed Compliance

ProACT Agentic SOC

Digital Forensics & IR (DFIR)

Prism

Security Testing

Quantum Security

Data Protection and Governance

Data Privacy Consulting Services

Training & Workshops

Workshop Calendar

Training and Certification Store

SISA One Platform

WHY SISA

Company