Why systems with AI components need a different approach, threat model, and team.
Traditional pentesting is still necessary when systems with AI components are involved, but it is no longer enough. These systems introduce categories of attack that don’t appear in the classic catalogue, and the more autonomous and agentic the setup, the wider the gap between what a conventional audit covers and what a real attacker can exploit. This article explains why running a pentest on these systems requires a different approach, a different threat model, and a team with specific skills.
For two decades, pentesting has revolved around well-known primitives like injections in deterministic interpreters, access control failures, data handling failures, configuration failures, or business logic flaws. They are complex problems, but they are well characterised. For a given input, we get a well-defined output, and we can build reliable patterns to decide whether a test succeeded. Applications with AI break almost all of those assumptions.
In a classic pentest you work against code, and the same input produces the same output. In an application with an LLM in its execution flow, the model is probabilistic by nature. Even if you ask it the exact same question twice, its answer can vary. Many things contribute to that. Internal randomness in the model plays a role, but so do changes in context order, a silent new version published by the provider, tweaks to the system prompt, or even subtle differences in how the text’s tokens are split before reaching the model. Any of those factors can alter behaviour.
Three uncomfortable implications:
A serious pentest doesn’t stop at “we found three high-severity findings”. It characterises each finding with metrics: “this attack succeeds 38% of the time over 500 runs, and with this adjustment it drops to 12%”. That information is what enables prioritisation and decision-making, something qualitative severity alone does not offer.
An LLM in production is rarely alone. It is usually surrounded by several pieces that work alongside it. A document base that provides context for each query (RAG), input and output filtering and moderation layers, a set of tools the model can invoke to perform real actions (querying an internal API, reading an email, running a query against a database, invoking an MCP tool, browsing the web), orchestrators that coordinate several models or agents working together, memories that recall past interactions, and, throughout, external data sources over which a stable level of control is hard to maintain.
Each of those joints, where something non-deterministic (the model) meets something deterministic and privileged (an API, a database, a mail client), is where serious accidents happen. In a classic pentest, the line between data and instructions is well defined: queries go through one channel, data through another, and the interpreter knows how to tell them apart. In a system with an LLM, that line disappears. Anything that enters the model’s context can be interpreted as instruction, and that single property is the origin of half the new classes of attack.
A practical distinction that shapes the whole engagement: how much of the system you actually control.
Mistaking these three cases is one of the most expensive errors we see when defining the scope of an engagement. People pay for a pentest believing “the AI” is being tested when in reality only the surrounding box is, or vice versa. That is why the kickoff with the client starts from a different place than in a classic pentest. It’s not enough to define network ranges, domains, and credentials. From the very beginning you have to map which parts of the model are under control, whether there has been fine-tuning and with what data, where and how the model is deployed, which tools the agent can invoke, which sources feed the RAG, and what privileges the identities that the agent uses to act actually have. That initial conversation determines what an AI pentest can look at and what has to be left out because it falls outside the client’s reach.
Prompt injection tops the OWASP Top 10 for LLM Applications (2025) [1] for a reason. It isn’t a bug, it’s an emergent property of how these systems work, and that’s the essential difference from traditional injections.
In a classic SQL injection, we have deterministic, auditable defences. Prepared statements keep data and instructions strictly apart, the database engine knows what is a query and what is data, and we can read the code to verify that the separation holds. With an LLM there is no such equivalent. The model processes the entire context as a single natural-language stream, and there is no deterministic way to tell it “this is data, not instruction”. Any text that ends up in its context window can act as instruction, regardless of where it comes from.
That has two practical consequences for a pentest. The first: entry points multiply. There are obvious ones (the user’s direct input) and indirect ones, where the attacker never speaks to the system but plants instructions in content the system will eventually read. A web page the agent visits, a document in the RAG, an email an assistant summarises, image metadata, comments on a pull request, a free-text field in a ticket. When the LLM processes that content, it executes the attacker’s instructions with the system’s privileges.
The second: payloads hide in ways that traditional pentesting does not track. Invisible Unicode characters that the model reads but the human eye does not see. Text hidden on a web page through visual tricks (zero size, colour matching the background), so the user sees a normal page while the agent summarising it reads something else. Payloads split across several fragments that only make sense once they come together inside the context. In multimodal systems, instructions inside an image, or imperceptible modifications to an audio clip.
Finally, multi-turn attacks exploit the fact that many filters look at messages one by one. Techniques like Crescendo [2] or Deceptive Delight [3] lead the conversation down seemingly benign paths until reaching the forbidden objective. Public research reports very high success rates against frontier models when defences evaluate one turn at a time [2][4].
There’s another pattern a classic pentester will recognise immediately: many traditional classes of attack have their version adapted to the world of AI, where the underlying idea is preserved but branches out as it meets the particularities of the new context, opening up paths that didn’t exist in the original. We’ve explored this in detail in Blind Prompt Injection: The New Blind SQL Injection in AI Automations [5], where we take a well-known classic attack and show how it transfers to the AI domain.
There is no equivalent of a WAF for this. You can deploy input filters, classifiers, or guardrails that detect suspicious words, patterns, or structures, and they’re useful as an additional layer. But all of those defences are probabilistic, and none of them closes the problem at the root: an equivalent formulation they haven’t seen will always exist. The main defence has to be architectural: separate privileges, give each tool only the bare minimum it needs, validate the structure of model outputs before acting on them, and require human confirmation for especially sensitive actions that cannot be undone. Evaluating whether those defences exist and whether they actually work in the real architecture is precisely the kind of work that a classic pentest is not designed to do.
Guardrails are the specific defences placed around the model. Classifiers that try to detect dangerous content in the input, filters that look for personal data in the output, or even another model acting as a “judge” deciding whether the main response is acceptable.
This is where another key difference from classic pentesting appears. Traditional defences (a WAF, an input sanitiser, a validation rule) are deterministic. They can be audited by reading their configuration, their rules, or their source code. If the WAF blocks a pattern, it always blocks it, and checking the regex is enough to know it. AI guardrails are probabilistic, and that changes how they have to be tested.
They sometimes fail: they let through attacks they should block and block legitimate things they should let through. That false-negative rate doesn’t come out of a config file, it has to be measured experimentally. They can be evaded with techniques designed specifically against them (rephrasing the attack, translating it into a language less covered by the defences, obfuscating it), and if a model is used as judge, that judge inherits all the problems of any LLM. Techniques like Bad Likert Judge [4] have shown that the judge can be convinced to label clearly dangerous material as “safe”.
The difference for the pentester is direct. A WAF is audited by inspecting its configuration and reviewing its rules. A guardrail is audited by subjecting it to many attempts and measuring its statistical behaviour, with its own methodology and specific metrics.
Simon Willison coined the term [6]. An agent is exploitable by design when it combines three characteristics at once:
The key is that point 1 describes what the agent can see inside the perimeter, point 2 describes what comes into the agent from outside the perimeter, and point 3 describes what can leave. Useful agents tend to have all three capabilities at once, because that’s precisely what makes them useful. The OWASP Top 10 for Agentic Applications (2025) [7] systematises the resulting risks, including excessive agent autonomy, misuse of the user identity the agent represents, memory poisoning of one agent by another, failures cascading when several agents call each other, and something subtler: the exploitation of the human operator’s trust, where the agent produces such polished justifications that the human ends up approving actions they shouldn’t.
Compared to traditional pentesting, the conceptual shift is substantial. You have to think in chains of compromise, not in isolated vulnerabilities:
A comment on a ticket (controlled by an external customer) → a support agent reads it while summarising the queue → the hidden instruction makes it query the CRM → it extracts emails from other customers → it sends them to an external URL through its HTTP tool.
No single piece is “a vulnerability” in the classic sense. Together they are a full compromise. A classic pentest, evaluating each component in isolation, doesn’t see it. An AI pentest has to model precisely these chains and test them end to end.
The Model Context Protocol (MCP) and tool use in general are the bridge between a model that reasons in natural language and real systems with privileges. Both classic and new bugs live here, and the difference from a traditional pentest lies precisely at that intersection.
On one hand, MCP servers are normal code. They have SQL injections, OS command execution, out-of-path file accesses, and uncontrolled outbound requests, just like any other component. The public case of Anthropic’s reference SQLite MCP server (Trend Micro, 2025) [8] showed how a SQL injection allowed malicious instructions to be stored in the database, ready to be executed the next time an agent read them. Vulnerabilities like CVE-2025-49596 (MCP Inspector) [9] or CVE-2025-59528 (Flowise, severity 10.0) [10] reinforce the pattern: agentic builders and gateways inherit all the AppSec debt.
On the other hand, agent-specific vectors emerge. Tool poisoning consists of a malicious MCP server describing the tool it offers with metadata that contains hidden instructions aimed at the LLM (“when they ask you for X, run Y first and don’t tell the user”). That description, in practice, is part of the prompt. There is also the classic confused deputy pattern, a component with legitimate privileges that is manipulated into using them on behalf of a third party. Applied to agents: the user authorises something general, the agent interprets it too broadly, and ends up using the user’s privileges for something the user would never have approved had it been explained.
The essential difference from a classic pentest is that now a classic vulnerability can be triggered by a sentence in natural language. A SQL injection that previously required an attacker with direct access to the endpoint can now be set off by an unsuspecting agent reading a poisoned document. Auditing an environment with MCPs therefore requires combining classic code analysis and fuzzing with adversarial reasoning about how the LLM can drag the server outside its expected behaviour. Neither a classic pentester alone nor an LLM specialist alone covers both sides; someone with experience across both worlds does.
RAG (Retrieval-Augmented Generation), the technique that retrieves relevant documents from a private base and injects them into the model’s context before answering, turns your knowledge base into attack surface.
The difference from a traditional database pentest is twofold. First, permissions in a RAG aren’t enforced by a deterministic engine that understands users and roles, but by a similarity search step that decides which documents are most similar to the question. If that step doesn’t correctly enforce the user’s permissions, the “leak” isn’t a query with a misplaced WHERE clause, it’s a document showing up in a result set where it shouldn’t. Second, until now databases didn’t reason about their content: they returned records. Now the model acts on what is retrieved.
Some specific vectors:
Classic pentesting evaluates none of this, because the classic defences (access control on the database, encryption at rest, tenant isolation) are applied at layers that here can be bypassed entirely without ever touching the database engine.
Beyond the prompt, there is a family of attacks inherited from the world of adversarial machine learning that still applies and that traditional pentesting ignores entirely, simply because there is no conceptual equivalent in a non-AI application.
These tests require a different profile, different instrumentation, and different metrics. They are rarely covered by a traditional pentest with “an LLM layer” added on.
LLMs introduce a class of economic attack that traditional pentesting does not contemplate, and the difference from classic defences is very concrete.
The most striking variant is Denial of Wallet. The attacker doesn’t aim to take the service down: they want it to keep running, but to ruin the bill. Prompts are sent that occupy almost the entire context window of the model to maximise the cost of each request. Inputs are crafted that make the model feed itself in a loop. The reasoning mode of newer models, the ones that “think” in writing before answering, is exploited to make them spend thousands of internal tokens even if the final answer is short. There are public cases of bills running into tens of thousands of dollars generated in hours from a single leaked key [14].
There is also LLMjacking: stolen provider credentials (AWS Bedrock, Azure OpenAI, Gemini) used to consume someone else’s compute. Public research has tracked operations costing victims more than $40,000 per day [15].
The difference from a classic pentest is direct. Traditional rate limits count requests per unit of time, and with that metric an attacker can stay well below the threshold while systematically firing the most expensive requests available. What’s needed is a control that measures cost per user per unit of time, token quotas, and alerts on spend, not just on traffic volume. This falls outside the catalogue of controls a traditional web pentest is equipped to evaluate.
There is a boundary many teams still treat as ethics rather than security: the generation of harmful content, systematic bias, misinformation, emotional manipulation, or refusals that break easily. They are alignment problems, but they are also direct legal, reputational, and regulatory risks, from GDPR and the EU AI Act to sectoral regulation in healthcare or finance.
There’s also another dimension that traditional pentesting does not have to solve. In a classic application, the impact of a failure is objective. If an unauthorised identity has read a database, accessed the file system, or executed a privileged action, there is a problem and no interpretation is required. That category of failures still exists in an AI application and is evaluated with the usual criteria. But on top of that, another layer appears: the responses the model gives in natural language can be technically correct and still go against the system’s purpose or against the interests of the organisation offering it. A banking assistant that starts speculating about cryptocurrencies, a support agent that recommends competitor products, a chatbot for a medical service that gives advice outside the authorised clinical scope. Knowing where that line lies is not something the pentest team can decide on its own. To prevent the assessment from depending on the evaluator’s subjectivity, you have to understand precisely the system’s purpose, the organisation’s specific concerns, and the concrete criteria the client considers acceptable or unacceptable in the model’s outputs. Without that business context, this layer of risk cannot be properly evaluated.
A complete AI pentest also evaluates this layer: attempts to bypass restrictions to obtain prohibited content, differential biases across demographic groups, the model’s ability to persuade or manipulate, robustness of refusal under role-play and multi-turn pressure. Something not landing as a CVE doesn’t mean it won’t land in the courts or in the press.
Each architecture opens up its own fronts and the attack landscape is constantly evolving, but any serious AI pentest should contemplate, at minimum, a list along these lines:
If a vendor cannot explain how they tackle most of these points, they’re probably selling you a traditional pentest with a coat of paint.
The AI pentester is not a traditional pentester who has been given an OpenAI course. Nor is it a data scientist who has read the OWASP Top 10. It is the point where those two profiles meet. Someone who understands injections, insecure deserialisation, authorisation bypasses, and lateral movement between services, and who at the same time understands why a model is more susceptible to an attack rephrased in a language poorly represented in its training, what happens when you raise the model’s temperature from 0.1 to 0.9, or why a stored embedding is essentially the original document.
That double depth is what allows the team to detect things no scanner will see. The email with an invisible character that drives the agent into doing something it shouldn’t, the free-text field of a ticket that ends up executing as instruction, the loop that turns a reasoning model into an open tap for tokens, or the community-sourced MCP server with a SQL injection that doesn’t even need to be exploited directly: it’s enough that the agent queries it.
The recently published ETSI EN 304 223 [16], the first European standard with global applicability that establishes specific cybersecurity requirements for AI models and systems, makes this explicit in its provision 5.2.5-2.1:
“For security testing, System Operators and Developers should use independent security testers with technical skills relevant to their AI systems.”
The direction set by the standard is clear: when what’s being tested is an AI system, the team’s technical capabilities matter, and not every pentest team will do.
At Kaptor Security that’s exactly what we do: pentesting on applications, agents, and infrastructures based on AI. We combine years of experience in traditional pentesting with deep specialisation in artificial intelligence, which lets us tackle this kind of audit without giving up either perspective. We aren’t a consultancy that has added AI to its catalogue; it’s our focus. If you’re building or deploying systems with AI components and you’d like to know how they would hold up against a real attacker, let’s talk.