Classic pentest vs. AI pentest: why it isn't the same thing with extra steps

Introduction
1 From deterministic code to probabilistic reasoning
2 A different attack surface
3 What you control and what you don’t
4 Prompt injection: the injection you can’t parse
5 Probabilistic guardrails
6 The “lethal trifecta” in agentic systems
7 MCP and tools: where AI touches privileged code
8 RAG and embeddings: poisoning the knowledge
9 Attacks on the model as a model
10 Unbounded consumption: DoS, Denial of Wallet, and LLMjacking
11 Safety is not a separate problem: it’s part of the scope
What a thorough AI pentest should cover at minimum
The team’s profile matters more than it might seem
References

Introduction

Traditional pentesting is still necessary when systems with AI components are involved, but it is no longer enough. These systems introduce categories of attack that don’t appear in the classic catalogue, and the more autonomous and agentic the setup, the wider the gap between what a conventional audit covers and what a real attacker can exploit. This article explains why running a pentest on these systems requires a different approach, a different threat model, and a team with specific skills.

For two decades, pentesting has revolved around well-known primitives like injections in deterministic interpreters, access control failures, data handling failures, configuration failures, or business logic flaws. They are complex problems, but they are well characterised. For a given input, we get a well-defined output, and we can build reliable patterns to decide whether a test succeeded. Applications with AI break almost all of those assumptions.

1. From deterministic code to probabilistic reasoning

In a classic pentest you work against code, and the same input produces the same output. In an application with an LLM in its execution flow, the model is probabilistic by nature. Even if you ask it the exact same question twice, its answer can vary. Many things contribute to that. Internal randomness in the model plays a role, but so do changes in context order, a silent new version published by the provider, tweaks to the system prompt, or even subtle differences in how the text’s tokens are split before reaching the model. Any of those factors can alter behaviour.

Three uncomfortable implications:

A failed test does not prove the system is secure. The same attack might fail eighty times and succeed on the eighty-first. You have to think in terms of success rates over many runs (what the literature calls Attack Success Rate, ASR), not in terms of binary success/failure as in a classic pentest.
A test that works today may stop working tomorrow (and vice versa) after a routine change from the model provider. This doesn’t invalidate a one-off audit, but its snapshot does expire faster than that of a classic pentest, and it’s worth complementing it with targeted tests when there are relevant changes to the model, the pipeline, or the tools.
Measuring what’s happening requires specific tooling that a classic pentester doesn’t usually work with. You need to fix the model’s randomness when possible, keep track of exactly which version is being tested at any given time, and build a test harness capable of running the same attack hundreds of times to measure how often it succeeds.

A serious pentest doesn’t stop at “we found three high-severity findings”. It characterises each finding with metrics: “this attack succeeds 38% of the time over 500 runs, and with this adjustment it drops to 12%”. That information is what enables prioritisation and decision-making, something qualitative severity alone does not offer.

2. A different attack surface

An LLM in production is rarely alone. It is usually surrounded by several pieces that work alongside it. A document base that provides context for each query (RAG), input and output filtering and moderation layers, a set of tools the model can invoke to perform real actions (querying an internal API, reading an email, running a query against a database, invoking an MCP tool, browsing the web), orchestrators that coordinate several models or agents working together, memories that recall past interactions, and, throughout, external data sources over which a stable level of control is hard to maintain.

Each of those joints, where something non-deterministic (the model) meets something deterministic and privileged (an API, a database, a mail client), is where serious accidents happen. In a classic pentest, the line between data and instructions is well defined: queries go through one channel, data through another, and the interpreter knows how to tell them apart. In a system with an LLM, that line disappears. Anything that enters the model’s context can be interpreted as instruction, and that single property is the origin of half the new classes of attack.

3. What you control and what you don’t

A practical distinction that shapes the whole engagement: how much of the system you actually control.

If you consume a third-party model via API (OpenAI, Anthropic, Bedrock, Gemini), the weights, the training process, and the component that turns your text into input for the model are out of reach. You cannot audit them. Your effective testing scope is your integration: how you pass data to the model, how you receive it, what you do with it, and what privileges you grant it.
If you’ve done fine-tuning, that is, you’ve taken a base model and retrained it with your own data, the surface increases substantially. That retraining may weaken the model’s original alignment, may introduce backdoors (hidden behaviours that activate when a specific input is detected), and may cause the model to memorise sensitive data that someone might later extract with carefully crafted prompts.
If you deploy your own model on your infrastructure (whether on local servers or hosted in the cloud), that whole asset becomes part of the set of critical components to protect. There is more to test, and there is more to protect on a continuous basis. Here you should apply controls similar to those used in secure software development: verifying the model’s integrity through digital signatures before each deployment, adding steps in the CI/CD pipeline that block publication if the model has been altered, and maintaining traceability over who can update what in each environment.

Mistaking these three cases is one of the most expensive errors we see when defining the scope of an engagement. People pay for a pentest believing “the AI” is being tested when in reality only the surrounding box is, or vice versa. That is why the kickoff with the client starts from a different place than in a classic pentest. It’s not enough to define network ranges, domains, and credentials. From the very beginning you have to map which parts of the model are under control, whether there has been fine-tuning and with what data, where and how the model is deployed, which tools the agent can invoke, which sources feed the RAG, and what privileges the identities that the agent uses to act actually have. That initial conversation determines what an AI pentest can look at and what has to be left out because it falls outside the client’s reach.

4. Prompt injection: the injection you can’t parse

Prompt injection tops the OWASP Top 10 for LLM Applications (2025) [1] for a reason. It isn’t a bug, it’s an emergent property of how these systems work, and that’s the essential difference from traditional injections.

In a classic SQL injection, we have deterministic, auditable defences. Prepared statements keep data and instructions strictly apart, the database engine knows what is a query and what is data, and we can read the code to verify that the separation holds. With an LLM there is no such equivalent. The model processes the entire context as a single natural-language stream, and there is no deterministic way to tell it “this is data, not instruction”. Any text that ends up in its context window can act as instruction, regardless of where it comes from.

That has two practical consequences for a pentest. The first: entry points multiply. There are obvious ones (the user’s direct input) and indirect ones, where the attacker never speaks to the system but plants instructions in content the system will eventually read. A web page the agent visits, a document in the RAG, an email an assistant summarises, image metadata, comments on a pull request, a free-text field in a ticket. When the LLM processes that content, it executes the attacker’s instructions with the system’s privileges.

The second: payloads hide in ways that traditional pentesting does not track. Invisible Unicode characters that the model reads but the human eye does not see. Text hidden on a web page through visual tricks (zero size, colour matching the background), so the user sees a normal page while the agent summarising it reads something else. Payloads split across several fragments that only make sense once they come together inside the context. In multimodal systems, instructions inside an image, or imperceptible modifications to an audio clip.

Finally, multi-turn attacks exploit the fact that many filters look at messages one by one. Techniques like Crescendo [2] or Deceptive Delight [3] lead the conversation down seemingly benign paths until reaching the forbidden objective. Public research reports very high success rates against frontier models when defences evaluate one turn at a time [2][4].

There’s another pattern a classic pentester will recognise immediately: many traditional classes of attack have their version adapted to the world of AI, where the underlying idea is preserved but branches out as it meets the particularities of the new context, opening up paths that didn’t exist in the original. We’ve explored this in detail in Blind Prompt Injection: The New Blind SQL Injection in AI Automations [5], where we take a well-known classic attack and show how it transfers to the AI domain.

There is no equivalent of a WAF for this. You can deploy input filters, classifiers, or guardrails that detect suspicious words, patterns, or structures, and they’re useful as an additional layer. But all of those defences are probabilistic, and none of them closes the problem at the root: an equivalent formulation they haven’t seen will always exist. The main defence has to be architectural: separate privileges, give each tool only the bare minimum it needs, validate the structure of model outputs before acting on them, and require human confirmation for especially sensitive actions that cannot be undone. Evaluating whether those defences exist and whether they actually work in the real architecture is precisely the kind of work that a classic pentest is not designed to do.

5. Probabilistic guardrails

Guardrails are the specific defences placed around the model. Classifiers that try to detect dangerous content in the input, filters that look for personal data in the output, or even another model acting as a “judge” deciding whether the main response is acceptable.

This is where another key difference from classic pentesting appears. Traditional defences (a WAF, an input sanitiser, a validation rule) are deterministic. They can be audited by reading their configuration, their rules, or their source code. If the WAF blocks a pattern, it always blocks it, and checking the regex is enough to know it. AI guardrails are probabilistic, and that changes how they have to be tested.

They sometimes fail: they let through attacks they should block and block legitimate things they should let through. That false-negative rate doesn’t come out of a config file, it has to be measured experimentally. They can be evaded with techniques designed specifically against them (rephrasing the attack, translating it into a language less covered by the defences, obfuscating it), and if a model is used as judge, that judge inherits all the problems of any LLM. Techniques like Bad Likert Judge [4] have shown that the judge can be convinced to label clearly dangerous material as “safe”.

The difference for the pentester is direct. A WAF is audited by inspecting its configuration and reviewing its rules. A guardrail is audited by subjecting it to many attempts and measuring its statistical behaviour, with its own methodology and specific metrics.

6. The “lethal trifecta” in agentic systems

Simon Willison coined the term [6]. An agent is exploitable by design when it combines three characteristics at once:

Access to private data of its own. The agent can read sensitive information that belongs to the organisation or to the user it represents (internal files, authenticated APIs, databases, the user’s own mailbox).
Ingestion of untrusted external content. The agent processes as input external content whose provenance it cannot verify (web pages it browses, incoming emails, documents uploaded by third parties, tickets filled out by customers, outputs returned by third-party tools). That content may have been crafted to carry hidden instructions.
Ability to communicate externally. The agent can push information out (sending emails, outbound HTTP calls, writes into remote systems).

The key is that point 1 describes what the agent can see inside the perimeter, point 2 describes what comes into the agent from outside the perimeter, and point 3 describes what can leave. Useful agents tend to have all three capabilities at once, because that’s precisely what makes them useful. The OWASP Top 10 for Agentic Applications (2025) [7] systematises the resulting risks, including excessive agent autonomy, misuse of the user identity the agent represents, memory poisoning of one agent by another, failures cascading when several agents call each other, and something subtler: the exploitation of the human operator’s trust, where the agent produces such polished justifications that the human ends up approving actions they shouldn’t.

Compared to traditional pentesting, the conceptual shift is substantial. You have to think in chains of compromise, not in isolated vulnerabilities:

A comment on a ticket (controlled by an external customer) → a support agent reads it while summarising the queue → the hidden instruction makes it query the CRM → it extracts emails from other customers → it sends them to an external URL through its HTTP tool.

No single piece is “a vulnerability” in the classic sense. Together they are a full compromise. A classic pentest, evaluating each component in isolation, doesn’t see it. An AI pentest has to model precisely these chains and test them end to end.

7. MCP and tools: where AI touches privileged code

The Model Context Protocol (MCP) and tool use in general are the bridge between a model that reasons in natural language and real systems with privileges. Both classic and new bugs live here, and the difference from a traditional pentest lies precisely at that intersection.

On one hand, MCP servers are normal code. They have SQL injections, OS command execution, out-of-path file accesses, and uncontrolled outbound requests, just like any other component. The public case of Anthropic’s reference SQLite MCP server (Trend Micro, 2025) [8] showed how a SQL injection allowed malicious instructions to be stored in the database, ready to be executed the next time an agent read them. Vulnerabilities like CVE-2025-49596 (MCP Inspector) [9] or CVE-2025-59528 (Flowise, severity 10.0) [10] reinforce the pattern: agentic builders and gateways inherit all the AppSec debt.

On the other hand, agent-specific vectors emerge. Tool poisoning consists of a malicious MCP server describing the tool it offers with metadata that contains hidden instructions aimed at the LLM (“when they ask you for X, run Y first and don’t tell the user”). That description, in practice, is part of the prompt. There is also the classic confused deputy pattern, a component with legitimate privileges that is manipulated into using them on behalf of a third party. Applied to agents: the user authorises something general, the agent interprets it too broadly, and ends up using the user’s privileges for something the user would never have approved had it been explained.

The essential difference from a classic pentest is that now a classic vulnerability can be triggered by a sentence in natural language. A SQL injection that previously required an attacker with direct access to the endpoint can now be set off by an unsuspecting agent reading a poisoned document. Auditing an environment with MCPs therefore requires combining classic code analysis and fuzzing with adversarial reasoning about how the LLM can drag the server outside its expected behaviour. Neither a classic pentester alone nor an LLM specialist alone covers both sides; someone with experience across both worlds does.

8. RAG and embeddings: poisoning the knowledge

RAG (Retrieval-Augmented Generation), the technique that retrieves relevant documents from a private base and injects them into the model’s context before answering, turns your knowledge base into attack surface.

The difference from a traditional database pentest is twofold. First, permissions in a RAG aren’t enforced by a deterministic engine that understands users and roles, but by a similarity search step that decides which documents are most similar to the question. If that step doesn’t correctly enforce the user’s permissions, the “leak” isn’t a query with a misplaced WHERE clause, it’s a document showing up in a result set where it shouldn’t. Second, until now databases didn’t reason about their content: they returned records. Now the model acts on what is retrieved.

Some specific vectors:

PoisonedRAG (USENIX 2025) [11] showed that five carefully crafted documents inside a base of millions can manipulate the answer to specific questions more than 90% of the time. There’s no need to poison the entire corpus; sneaking into the result window of a query is enough.
Embedding-level poisoning. Hidden instructions (with invisible characters, with typographic tricks) survive the process of turning text into vectors and keep acting when the model reads them.
Partial text recovery from a vector. Published research [12] shows that if vectors are leaked, between 50% and 70% of the original text can be reconstructed. A vector base is not a hash: for many purposes it is a compressed form of the document. This changes the threat model compared to a traditional database.
Cross-tenant leakage in multi-tenant bases with poor isolation, where semantic searches return documents belonging to other tenants because permissions are not propagated to the retrieval step.
Attacks on document parsers. PDF, DOCX, or HTML loaders interpret files differently from how a human sees them, and that difference allows hiding instructions the user will never see but the ingestion pipeline will. There is research reporting success rates of 74% against popular loaders [13].

Classic pentesting evaluates none of this, because the classic defences (access control on the database, encryption at rest, tenant isolation) are applied at layers that here can be bypassed entirely without ever touching the database engine.

9. Attacks on the model as a model

Beyond the prompt, there is a family of attacks inherited from the world of adversarial machine learning that still applies and that traditional pentesting ignores entirely, simply because there is no conceptual equivalent in a non-AI application.

Membership inference: determining whether a specific data point was part of the training or fine-tuning set. It sounds abstract, but it’s a serious problem under GDPR or healthcare regulation if what’s confirmed is, for example, that a particular person was in a medical dataset.
Model cloning through queries: partially reproducing the model’s behaviour by asking it many well-chosen questions and learning from the responses. If the API returns probabilities per token (the so-called logprobs), the attacker gets much more signal per query and the cloning accelerates dramatically. Exposing that data when it isn’t strictly necessary makes the attack easier.
Reconstruction of training examples from the model’s outputs.
Backdoors introduced during fine-tuning: hidden behaviours triggered when the prompt contains a specific sequence, for instance a phrase, an unusual word, or an uncommon emoji.

These tests require a different profile, different instrumentation, and different metrics. They are rarely covered by a traditional pentest with “an LLM layer” added on.

10. Unbounded consumption: DoS, Denial of Wallet, and LLMjacking

LLMs introduce a class of economic attack that traditional pentesting does not contemplate, and the difference from classic defences is very concrete.

The most striking variant is Denial of Wallet. The attacker doesn’t aim to take the service down: they want it to keep running, but to ruin the bill. Prompts are sent that occupy almost the entire context window of the model to maximise the cost of each request. Inputs are crafted that make the model feed itself in a loop. The reasoning mode of newer models, the ones that “think” in writing before answering, is exploited to make them spend thousands of internal tokens even if the final answer is short. There are public cases of bills running into tens of thousands of dollars generated in hours from a single leaked key [14].

There is also LLMjacking: stolen provider credentials (AWS Bedrock, Azure OpenAI, Gemini) used to consume someone else’s compute. Public research has tracked operations costing victims more than $40,000 per day [15].

The difference from a classic pentest is direct. Traditional rate limits count requests per unit of time, and with that metric an attacker can stay well below the threshold while systematically firing the most expensive requests available. What’s needed is a control that measures cost per user per unit of time, token quotas, and alerts on spend, not just on traffic volume. This falls outside the catalogue of controls a traditional web pentest is equipped to evaluate.

11. Safety is not a separate problem: it’s part of the scope

There is a boundary many teams still treat as ethics rather than security: the generation of harmful content, systematic bias, misinformation, emotional manipulation, or refusals that break easily. They are alignment problems, but they are also direct legal, reputational, and regulatory risks, from GDPR and the EU AI Act to sectoral regulation in healthcare or finance.

There’s also another dimension that traditional pentesting does not have to solve. In a classic application, the impact of a failure is objective. If an unauthorised identity has read a database, accessed the file system, or executed a privileged action, there is a problem and no interpretation is required. That category of failures still exists in an AI application and is evaluated with the usual criteria. But on top of that, another layer appears: the responses the model gives in natural language can be technically correct and still go against the system’s purpose or against the interests of the organisation offering it. A banking assistant that starts speculating about cryptocurrencies, a support agent that recommends competitor products, a chatbot for a medical service that gives advice outside the authorised clinical scope. Knowing where that line lies is not something the pentest team can decide on its own. To prevent the assessment from depending on the evaluator’s subjectivity, you have to understand precisely the system’s purpose, the organisation’s specific concerns, and the concrete criteria the client considers acceptable or unacceptable in the model’s outputs. Without that business context, this layer of risk cannot be properly evaluated.

A complete AI pentest also evaluates this layer: attempts to bypass restrictions to obtain prohibited content, differential biases across demographic groups, the model’s ability to persuade or manipulate, robustness of refusal under role-play and multi-turn pressure. Something not landing as a CVE doesn’t mean it won’t land in the courts or in the press.

What a thorough AI pentest should cover at minimum

Each architecture opens up its own fronts and the attack landscape is constantly evolving, but any serious AI pentest should contemplate, at minimum, a list along these lines:

Specific threat modelling over the actual architecture, including agentic flows and real trust boundaries.
Honest scoping based on what is actually under control (own model, fine-tuning, RAG, integration over a third-party API).
Direct and indirect prompt injection across all sources of context, from the RAG to uploaded files, browsed websites, emails, tickets, metadata, and images.
Guardrails treated as adversarial systems: measure their success rate, look for evasions, test multi-turn robustness, and if there is a model-judge, attack it too.
Lethal trifecta evaluation at the agent level and the search for plausible chains of compromise.
Auditing of tools and MCP servers: classic pentest plus tool poisoning, strict validation of what the model returns before executing it, and human confirmation on irreversible actions.
Safe handling of model outputs: validation and sanitisation before passing them to browsers, interpreters, databases, or file systems, to prevent the LLM’s output from turning into XSS, SSRF, SQLi, or RCE in downstream components.
RAG attacks: poisoning, cross-tenant leakage, embedding inversion, parser robustness.
Attacks on the model: membership inference, query-based cloning, training reconstruction, backdoors inherited from fine-tuning.
Consumption: cost-based testing (not just volume), recursion, context flooding.
Extraction and leakage: system prompt leak, leakage of personal data, memorisation, escape through tools.
Multi-turn robustness: Crescendo, Deceptive Delight, prolonged role-play, behavioural degradation as the conversation grows longer.
Safety as part of the scope: jumps to harmful content, biases, manipulation, applicable regulatory compliance.
Supply chain: provenance of models, tokenisers, public MCPs, embedding providers, integrity verified through signing.
Reproducible metrics: success rates with confidence intervals, repetition conditions, and the exact model versions used.

If a vendor cannot explain how they tackle most of these points, they’re probably selling you a traditional pentest with a coat of paint.

The team’s profile matters more than it might seem

The AI pentester is not a traditional pentester who has been given an OpenAI course. Nor is it a data scientist who has read the OWASP Top 10. It is the point where those two profiles meet. Someone who understands injections, insecure deserialisation, authorisation bypasses, and lateral movement between services, and who at the same time understands why a model is more susceptible to an attack rephrased in a language poorly represented in its training, what happens when you raise the model’s temperature from 0.1 to 0.9, or why a stored embedding is essentially the original document.

That double depth is what allows the team to detect things no scanner will see. The email with an invisible character that drives the agent into doing something it shouldn’t, the free-text field of a ticket that ends up executing as instruction, the loop that turns a reasoning model into an open tap for tokens, or the community-sourced MCP server with a SQL injection that doesn’t even need to be exploited directly: it’s enough that the agent queries it.

The recently published ETSI EN 304 223 [16], the first European standard with global applicability that establishes specific cybersecurity requirements for AI models and systems, makes this explicit in its provision 5.2.5-2.1:

“For security testing, System Operators and Developers should use independent security testers with technical skills relevant to their AI systems.”

The direction set by the standard is clear: when what’s being tested is an AI system, the team’s technical capabilities matter, and not every pentest team will do.

At Kaptor Security that’s exactly what we do: pentesting on applications, agents, and infrastructures based on AI. We combine years of experience in traditional pentesting with deep specialisation in artificial intelligence, which lets us tackle this kind of audit without giving up either perspective. We aren’t a consultancy that has added AI to its catalogue; it’s our focus. If you’re building or deploying systems with AI components and you’d like to know how they would hold up against a real attacker, let’s talk.

References

OWASP Top 10 for Large Language Model Applications (2025). https://genai.owasp.org/llm-top-10/
Russinovich, M., Salem, A., Eldan, R. Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack. USENIX Security, 2025. https://arxiv.org/abs/2404.01833
Unit 42, Palo Alto Networks. Deceptive Delight: Jailbreak LLMs Through Camouflage and Distraction. https://unit42.paloaltonetworks.com/jailbreak-llms-through-camouflage-distraction/
Unit 42, Palo Alto Networks. Bad Likert Judge: A Novel Multi-Turn Technique to Jailbreak LLMs by Misusing Their Evaluation Capability. https://unit42.paloaltonetworks.com/multi-turn-technique-jailbreaks-llms/
Kaptor Security. Blind Prompt Injection: The New Blind SQL Injection in AI Automations. https://kaptor.ai/blog/blind-prompt-injection.html
Willison, S. The lethal trifecta for AI agents. https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/
OWASP Top 10 for Agentic Applications (2025/2026). https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/
Trend Micro Research. Why a Classic MCP Server Vulnerability Can Undermine Your Entire AI Agent (2025). https://www.trendmicro.com/en_us/research/25/f/why-a-classic-mcp-server-vulnerability-can-undermine-your-entire-ai-agent.html
CVE-2025-49596 (MCP Inspector). https://nvd.nist.gov/vuln/detail/CVE-2025-49596
The Hacker News. Flowise AI Agent Builder Under Active CVSS 10.0 RCE Exploitation (CVE-2025-59528). https://thehackernews.com/2026/04/flowise-ai-agent-builder-under-active.html
Zou, W. et al. PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models. USENIX Security, 2025. https://arxiv.org/abs/2402.07867
Schneider, C. RAG security: the forgotten attack surface (summary of research on embedding inversion and OWASP LLM08). https://christian-schneider.net/blog/rag-security-forgotten-attack-surface/
The Hidden Threat in Plain Text: Attacking RAG Data Loaders. ACM AISec Workshop, 2025. https://dl.acm.org/doi/10.1145/3733799.3762976
Sysdig Threat Research Team / MSSP Alert. LLMjacking Victims Can Lose Money, and Fast. Documented case of 80,000 calls to Bedrock in 3 hours, costing the victim between $24,000 and $30,000. https://www.msspalert.com/analysis/sysdig-llmjacking-victims-can-lose-money-and-fast
Brucato, A. LLMjacking: Stolen Cloud Credentials Used in New AI Attack. Sysdig Threat Research Team, 2024. Estimated cost to the victim above $46,000/day against Anthropic Claude 2.x on AWS Bedrock. https://www.sysdig.com/blog/llmjacking-stolen-cloud-credentials-used-in-new-ai-attack
ETSI EN 304 223 V2.1.1 (2025-12). Securing Artificial Intelligence (SAI); Baseline Cyber Security Requirements for AI Models and Systems. https://www.etsi.org/deliver/etsi_en/304200_304299/304223/02.01.01_60/en_304223v020101p.pdf

Classic pentest vs. AI pentest: why it isn’t “the same thing with extra steps”

Contents