AI applied to pentesting: approaches, architectures, and where it actually pays off

Introduction
Before the approaches: the model and the cost
The approaches
Architectural components that make the difference
Picking what pays off
References

Introduction

In recent years, artificial intelligence has gone from being an occasional curiosity in the offensive security world to becoming an increasingly recurring part of many pentesters’ workflows. The pace of academic publications, open source frameworks, and commercial tools orbiting around “AI pentesting” has accelerated sharply, and the list keeps growing.

This article doesn’t aim to enter the debate of whether AI will replace human pentesters. That discussion is saturated and, honestly, distracts from the real question, which is much more concrete: when, how, and at what cost does it make sense to lean on AI during a pentest?. What we do try to do here is review the main approaches being used, weigh the pros and cons of each, talk about the architectural components that make the difference between a useful tool and a token-burner, and share some practical thoughts on when each path is worth it.

The content that follows is based on the firsthand experience of the team at Kaptor Security. Not only from testing third-party tools, but also from building our own internal solutions that support our day-to-day work, and that we have been developing throughout our trajectory in cybersecurity and artificial intelligence as part of an ongoing evolution.

Before the approaches: the model and the cost

There are two variables that cut across everything else and are worth making clear from the start.

The first is the quality of the underlying model. The model’s ability to reason, maintain coherence on long tasks, generate creative payloads, and understand technical context makes a huge difference. In practice, models that excel at programming tend to excel at pentesting too, because a good part of the work, such as understanding stack traces, writing ad-hoc scripts, reading strange server responses, or chaining primitives, is reasoning about code and protocols. The frontier models from the major labs make a clear difference compared to smaller alternatives, not only in raw performance but in stability: they hallucinate less, tend not to overstate impact as much, abandon dead ends (lines of investigation that lead to no finding) earlier, and keep better track of the thread in long sessions.

The second variable is cost. Token spending depends a lot on the combination of the chosen model, the architecture it runs on, and the type of task. Using a frontier model with the full context of a serious pentest in an agentic architecture that iterates for hours bears no resemblance to a one-off query to chat for a specific payload. What does hold as a stable pattern is something more qualitative: an experienced human prioritizes with judgment that the agent still doesn’t replicate reliably, and that’s why every token consumed on dead ends or incoherent exploration ends up translating into runaway bills. The economics change a lot depending on whether you’re an organization with an industrial budget or an individual pentester playing with your API key, and it’s worth keeping this in mind when evaluating any of the approaches that follow.

The use cases are very different: wanting AI to run a complete autonomous pentest is not the same as using it as a complement for specific tasks. In most personal cases, unless you have a big budget, attempting the former will get you something both expensive and mediocre, and you’ll be sold snake oil with any of the “definitive solutions” that pop up every week.

The approaches

1. Classic conversational queries

Though it may be obvious, it’s worth including the most basic use of AI for pentesting: chat. That’s how we all started, and to a greater or lesser extent, we all keep doing it. You give it context, whether a weird endpoint, a code snippet, or a server response, you ask for an opinion, specific payloads, an ad-hoc script, information about a framework, ideas for attack paths. The AI is completely passive. You are the active party: you execute, you observe, you decide.

This mode is direct and introduces no scope risks, since the AI doesn’t touch the target. Its limit is that all the cognitive load is still on you: there’s no continuity between conversations, no memory of the engagement, no automation. Even so, it has its place for specific tasks like writing a polyglot payload, understanding a strange binary, generating a contextual wordlist, or drafting a POC, where a good model in a chat window provides immediate value without the complexity or the cost of the architectures we’ll cover later.

2. Classic scanners with AI inserted in the loop

The idea here is to take a deterministic scanner, like a fuzzer or a classic XSS, SQLi, or SSRF detector, and delegate to the AI only the specific decisions where creativity or context matter. For example, a scanner that, in order to detect XSS, first deterministically checks whether the payload is reflected, analyzes the context where it appears (HTML, attribute, JavaScript...), and then asks the AI to propose payloads tailored to that exact context. The scanner handles delivery, verification, finite retries, and session management, while the AI only contributes bounded creativity to a very specific decision.

This approach is relatively cheap compared to the ones we’ll see later for the autonomous pentest case. The problem is that, in practice, it makes very little use of what AI can contribute. The ability to adapt to diverse contexts, chain reasoning about strange server behaviors, or understand when two seemingly independent findings combine into a serious vulnerability is left out of the loop by design. The AI only enters through a very narrow slot in the flow, and most of the work is still done by the same old deterministic code. The result is that, compared to a classic pentest without AI, this approach usually brings marginal improvements, like slightly more creative payloads for one specific vulnerability class, with no noticeable impact on coverage or final report quality. On top of that, it requires specific engineering for each vulnerability type, which limits scalability and ends up making each new bug class a project in itself.

3. Basic agentic mode: the bag of tools

This is where things get interesting and, at the same time, where most cost/benefit problems start. The idea is to plug a set of tools into the model, directly or via MCP, and give it a broad prompt so it can use them freely until it accomplishes the goal. In the current ecosystem there are already several servers exposing hundreds of offensive tools (Nmap, gobuster, sqlmap, nuclei, hashcat, etc.) ready for any compatible LLM client to invoke autonomously.

In practice, this mode suffers from four recurring problems that anyone who has tried it for a few hours will recognize:

Nonsensical attack paths: the model tries combinations that a human would discard in two seconds, simply because they “fit” the pattern.
Loss of focus: it starts chasing an IDOR, finds a misconfigured CORS endpoint along the way, drifts off to test SSRF, comes back to something else, and forgets the initial goal.
Massive token spending: context grows out of control with verbose tool outputs. A single nmap -A already eats thousands of tokens; multiply that by twenty attempts.
High false positive rates: recent studies comparing agents to human pentesters in real enterprise environments find that agents present significantly higher false positive rates than humans and get especially stuck on tasks requiring a GUI [1].

For an individual pentester, basic agentic mode tends not to pay off unless it’s scoped to very specific tasks, like “enumerate subdomains for this target and rank them by interest” or “try variations of payloads...”. As a complete pentest tool driven by a single prompt, it remains quite disappointing relative to what it costs.

4. Agentic mode with orchestration and specialized subagents

The next step, and where most recent academic research is concentrated, consists of adding a planning layer. An orchestrator generates a plan, decomposes the goal into subtasks, and delegates each one to specialized agents: one expert in reconnaissance, another in vulnerability analysis, another in exploitation (XSS, SQLi...), another in post-exploitation, etc. The literature has produced various ways to structure this plan: testing trees, task graphs, pentesting state machines, situation summaries, hierarchical memories with localized context activation, two-phase planning that combines CVEs with target keywords... Multi-agent architectures with differentiated roles (reconnaissance agent, vulnerability analysis agent, exploitation agent) have also been proposed, along with orchestrators specialized by domain: web pentest, privilege escalation, red team, etc.

This path is, in theory, well reasoned: it mimics the division of labor of a human team. In practice, it has three recurring problems:

The plans are excessively influenced by the planning prompts. As an experienced pentester, you watch the orchestrator steering the flow down a path you already know will lead nowhere, and yet the agent follows it because “that’s what the plan says”. There are unnecessary tasks no human would do, and obvious paths the orchestrator discards because they don’t fit its mental scheme. Evaluations published over the last year have documented this starkly [7]: agents perform reasonably well on focused tasks, but degrade significantly when they have to prioritize under uncertainty and abandon failed attack lines. Where a human pivots, the agent iterates on variations of the same approach. And the problem is that in the trace it doesn’t look like a failure: the outputs are still fluent and the commands plausible.

Token spending grows non-linearly. A well-designed architecture allows “going off-plan” when something promising appears, but each subagent maintains its own context, tool outputs are verbose, and coordination between the orchestrator and the specialists has a non-trivial fixed cost. A moderately long pentest can multiply spending by ten or twenty compared to conversational mode.

The reconnaissance phase is a brutal bottleneck. Part of recon can lean on classic tools like crawlers, port scanning, or directory fuzzers, and another part on components with finer-grained intelligence, like AI-driven navigation with Playwright to walk through application flows, identify relevant endpoints, or maintain complex sessions. But in the end, both paths converge on the same thing: a considerable amount of information (requests, responses, URL structures, parameters, observed behaviors, logical relationships between endpoints) that the model has to ingest, filter, and interpret to build a useful map of the target. If this isn’t optimized very carefully, you can spend a month’s salary just on the discovery phase, and most of the context entering the model will be noise.

So when does this approach pay off? If you’re an organization with an industrial budget to combine intensive use of frontier models with a serious engineering investment that genuinely optimizes reconnaissance, memory, and finding verification. For personal use, if you are already an experienced pentester, it currently tends to be expensive relative to the benefit. And if you give in to the temptation of cheapening it by leaning on low-cost models, you’ll end up sacrificing a considerable part of your time discarding noise just to reach a handful of specific findings you would probably have caught through the classic route with less effort and fewer resources.

5. Copilot with HITL (Human In The Loop)

This is, in my opinion and that of much of the applied literature, the most sensible path for an individual pentester who wants to integrate AI into their work. The idea is to build an architecture where the human is in the loop from the start, not as an emergency fallback.

The conceptual flow looks something like this:

You provide the AI with the context of the engagement: scope, application type, API specifications, technical documentation, authentication information, previous findings.
You request a task, such as auditing an endpoint, looking for broken business logic in a flow, or evaluating the surface of a microservice.
Before acting, the AI proposes a list of investigation lines, with their respective hypotheses and estimated cost.
You review: approve, discard, tweak, prioritize. This is where your cost/benefit judgment and your pentester intuition with hours of experience come in.
The AI executes only the approved paths, each with its own separate history and evolution.
At critical points, such as irreversible actions, strategic decisions, or findings that require validation, it comes back to you.

The reason this works so well is the same reason most modern code agents require explicit approval for each multi-file change, and the same reason operational experience with production agents throughout 2025 has converged, almost without exception, on the same conclusion: organizations that have tried to take the human out of the loop before having guarantees of observability and correctness have had the worst results. Recent research suggests that when humans edit AI-generated output, the final quality is notably higher than when they limit themselves to binarily accepting or rejecting what the model proposes [2]. The act of editing engages critical thinking in a way that approve/reject doesn’t.

In pentesting this matters especially because actions are irreversible and high-impact: deleting logs, modifying data, launching attacks that can take down the service, exfiltrating sensitive information. Distinguishing reversible from irreversible actions, and placing checkpoints only where they add value, is probably the most important design principle of the entire HITL architecture.

A note worth making: many tools are sold as “autonomous” but at their core remain HITL, because they operate over an external LLM client that maintains interaction with the operator. The difference with a well-designed copilot is not the presence or absence of a human, but where and how the human is inserted. Reviewing commands one by one is tiring and ends up being done poorly. Approving lines of investigation with prior analysis, by contrast, adds real value.

6. Generalist code agents: Claude Code and company

A path that has emerged strongly during 2025 and 2026, and deserves its own section, is the use of generalist code agents, such as Claude Code, Cursor, Aider, OpenCode, Codex, and similar, as a base for pentesting. The idea is interesting: these agents already know how to move around in a terminal, chain commands, read and write files, maintain long sessions, and have good reasoning about code. If you give them security tools and a system of “skills” or specialized subagents, you save much of the scaffolding engineering.

The appeal of this approach is twofold. First, the entry cost is low: a skill or a subagent is usually a markdown file, not a Python library with its scaffolding. Second, it inherits all the maturity of the underlying agent in terms of context management, command execution, error handling, and the model’s own evolution. When a better Claude or Codex comes out, all your skills improve at once.

The trade-off is that you’re reusing an agent designed to program, not to attack systems. For it to behave well in an offensive context and be truly useful throughout the lifecycle of a pentest, throwing it a prompt is not enough. You have to design the project leveraging all the coverage that modern code agents offer: project instruction files like CLAUDE.md, AGENTS.md or equivalents, skills (SKILL.md), pre/post-action hooks, custom commands, tools, specialized subagents, permissions, etc. Used well, these mechanisms let you turn the generalist agent into a companion with behavior and skills focused on pentesting. And not only from the standpoint of technical methodology, such as what tools to use, what vulnerability classes to explore first, or how to interpret certain target behaviors, but also operationally: how to organize the outputs it generates, where to store findings, what is persisted from each investigation path, what is discarded, how to resume the work in the next session.

This is where it connects naturally with approach 5: the mechanisms of hooks, commands, and subagents fit very well with a HITL flow in which the operator proposes, reviews, approves, and lets the agent execute. Well designed, this pattern lets you get the best of both worlds: the operational maturity of generalist code agents, and the human control points that prevent it from going down meaningless paths or reporting hallucinations as findings. It is, probably, one of the most promising approaches today for an individual pentester who wants to seriously integrate AI into their work, though it still requires context engineering and custom curation for the agent’s behavior to be truly aligned with the offensive domain.

There is also a practical consideration that is not minor: the choice of agent determines which model provider you marry. Using Claude Code ties you to Anthropic and its per-token pricing, and using Codex ties you to OpenAI and theirs. As we’ve seen lately, prices, quota limits, usage policies, and even model availability can change from one month to the next, and that directly affects the economic viability of your setup. Considering, in parallel or as an alternative, provider-agnostic agents like OpenCode, which let you connect different models through different APIs, is today a design decision almost as relevant as the project configuration itself. Keeping the ability to swap the model running behind the agent without rewriting your entire setup is a cheap insurance against changes you don’t control.

Architectural components that make the difference

Regardless of the approach, there are a handful of architectural decisions that separate a useful tool from one that gets out of hand.

Memory and continuity

A pentest doesn’t end in a single session. There are weeks-long engagements, findings that connect to others days later, identifiers you discover in phase 2 that turn out to be key in phase 5. Any AI system applied to pentesting must be able to answer:

What has been tried already, and with what result?
Where did we leave off in the last session?
What identifiers, endpoints, users, hashes have we discovered?
Where are the raw outputs of the tools that were previously executed?

The literature has responded with various structures: the Penetration Testing Tree of PentestGPT [3], the Penetration Memory Tree of TermiAgent [5], which activates memory in a localized way instead of injecting the entire tree into the context, the Penetration Task Graph of VulnBot [4], the State Machine of AutoPT, the Situation Summaries of AutoAttacker. All of them attack the same problem from different angles: how to maintain a state external to the model that can be queried, updated, and summarized without saturating the context. The classic separation between short-term memory, which is the active context window, and long-term memory, which is a persistent external storage with semantic search, is essential. No serious system should rely on the model remembering what happened 50 turns ago.

It’s also worth distinguishing three types of memory, borrowed from cognitive psychology but useful here: episodic, which records what happened and when (for example, “endpoint X returned a 500 with payload Y on Tuesday”), semantic, which stores generalized knowledge (“this API uses JWT with the HS256 algorithm”), and procedural, which stores how to do things (“for this application, the session is maintained with a SESSIONID cookie renewed every 15 min”). Agents that combine all three tend to behave much better on long engagements.

A critical note: persistent memory is also an attack surface [6]. An agent with memory that can be read and rewritten can be poisoned between sessions or accessed without authorization, and in a pentesting context that is especially delicate: it’s where the most sensitive data of the engagement is stored.

Context management and compression

Context is a constrained and expensive resource. An agent that receives verbose tool outputs without filtering will eventually hit the window limit, spend a fortune in tokens, and degrade its own attention. That’s why, in any serious agentic architecture, context management ceases to be an implementation detail and becomes a central part of the design.

The ideas that work best in practice revolve around three principles. First, don’t inflate the context with information that has already served its purpose: if a tool confirmed that a port is open, we don’t need to keep the full scan output spinning around for the rest of the engagement. Second, summarize and archive completed phases in external memory before the context saturates, keeping in the active window only what is directly relevant to what is being done right now. And third, partition the work into subagents with clean contexts when a specific task can be solved autonomously and we only need to return the result to the coordinator, not the entire trace.

Calibrating the prompting

An idea worth keeping in mind from the beginning, because it conditions many design decisions: the better a model performs at cybersecurity tasks, the less it benefits from, and the more it can be hurt by, loading its prompts with very detailed methodologies based on how we would do it ourselves. The analogy with programming makes it clear. If you explain step by step to a frontier programming model how to apply a backtracking algorithm, you’re not teaching it anything it doesn’t already know, and in many contexts you may be forcing it into a rigid sequence that the specific problem doesn’t need.

The same thing happens in pentesting. Applying our knowledge as pentesters adds value when it serves to scope the work, set priorities, give target context, or transfer prior findings. But filling the prompt with steps to follow, detailed checklists, and closed methodologies, especially with models that already know about pentesting on their own, tends to end up undermining their capacity to adapt, to find non-obvious paths, and to apply their own expert judgment. Let’s call it “over-prompting”. The idea is not to stop guiding the model, but to avoid replacing its reasoning with a recipe.

Verification and anti-hallucination

Probably the most underestimated problem. Models can claim, and frequently do, that they’ve found valid credentials that are actually fabricated, declare a vulnerability exploited when the response wasn’t correctly interpreted, or present as sensitive information that is public.

The practical magnitude of this is visible in ecosystem data. The curl project, after years of running a productive bug bounty program, decided to shut it down in January 2026 because the flood of AI-generated reports, with invented vulnerabilities, non-existent functions, and exploits that didn’t work, made the rate of genuine findings drop to around 5% [8]. It’s an extreme example, but it illustrates what happens when the output of a model is accepted without a serious verification layer.

Architectures that take this seriously incorporate a deterministic validation phase after each significant finding. For example, if the agent claims to have exploited an XSS, the system replays the payload in a headless browser and verifies that an alert was indeed triggered. If it claims to have discovered a valid credential, the credential is tested against the corresponding endpoint to confirm that the login works. The idea is not to trust the model’s interpretation and to require observable proof of the finding.

That said, not all vulnerabilities are this easy to validate. A vulnerability related to business logic, or to an authorization control, can admit interpretations that, at times, are not even immediately obvious to a human, and depend on the product context, on the intent of the flow, or on what is considered legitimate access. This is where HITL becomes key again: the human is the one who decides which agent findings are promoted to confirmed vulnerabilities, which are discarded, and which remain in the limbo of hypotheses pending deeper review. Without this filter, the final report is doomed to mix real findings with noise, and that, in the worst case, is worse than not having detected anything: making a remediation team waste time on a false positive destroys the credibility of the entire engagement.

Going off-plan and exploration-exploitation

Almost all architectures with an orchestrator share a limitation that a recent paper identified clearly [7]: all of them struggle with an important question, which is “is it worth continuing down this path?”. They have structure, whether in the form of trees, graphs, state machines, or memory, but they don’t have difficulty metrics or solid mechanisms to decide when to abandon a path and try another. The result is that they get stuck iterating on the same approach when a human would have already pivoted.

The most recent architectures are beginning to incorporate mechanisms to evaluate the difficulty of each subtask and to guide exploration more intelligently, balancing depth in the current path with openness to alternative paths when the current one isn’t advancing. It’s not magic, but it indicates that the problem is identified and there is a clear path of improvement.

Picking what pays off

No approach dominates the others. A pragmatic, condensed recommendation:

For recurring personal use without an industrial budget, probably the most sustainable option is an architecture that combines approaches 5 and 6: a HITL copilot supported by a code agent with curated skills, whether Claude Code, Codex, Cursor, OpenCode, or similar, depending on the case. That combination allows you to maintain continuity and memory between sessions, leverage the control mechanisms that modern code agents already provide, and preserve human judgment at the points where it really adds value.

For specific and well-scoped tasks, like creative fuzzing, contextual payload generation, binary analysis, or POC drafting, basic agentic mode with targeted tools performs well, especially if you combine it with deterministic verification. Pentesting MCP servers and subagent frameworks fit here.

For complete and autonomous pentest, the practical answer remains cautious. In real pentests, as of 2026, this approach rarely pays off for an individual pentester looking for a day-to-day solution without taking on high costs in terms of money, engineering, and manual review of results. Most tools sold today as “autonomous pentest”, with the exception of commercial offerings that resonate in everyone’s head (and that still don’t escape manual review and filtering), are still well below the level they promise.

Above any concrete recommendation, a more general idea remains. Despite the limitations and challenges we have been describing, by choosing a good architecture for each use case, AI can bring interesting value to a pentester’s daily work. And as models continue evolving in code and pentesting tasks, that value only keeps growing. The question for the coming years is not so much whether it’s worth integrating AI into the workflow, but when and how to do so to take advantage of it without losing control over what is actually happening.

At Kaptor Security we have developed tools that address all of the approaches reviewed above, and we keep evolving in order to have an internal infrastructure that improves our performance on every engagement. And if your organization is integrating AI into its processes and you want us to apply our manual and AI-guided analyses to verify whether you’re doing it securely, don’t hesitate to contact us.

References

Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing. arXiv:2512.09882. Available at: https://arxiv.org/abs/2512.09882
Taimoor Z., Human-in-the-Loop (HITL) for AI Agents: Patterns and Best Practices, DEV Community (2026). Available at: https://dev.to/taimoor__z/-human-in-the-loop-hitl-for-ai-agents-patterns-and-best-practices-5ep5
G. Deng et al., PentestGPT: An LLM-empowered automatic penetration testing tool. arXiv:2308.06782 (2024). Available at: https://arxiv.org/abs/2308.06782
H. Kong et al., VulnBot: Autonomous Penetration Testing for a Multi-Agent Collaborative Framework. arXiv:2501.13411 (2025). Available at: https://arxiv.org/abs/2501.13411
Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing, paper introducing TermiAgent and the Penetration Memory Tree. arXiv:2509.09207. Available at: https://arxiv.org/abs/2509.09207
A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty. arXiv:2604.16548. Available at: https://arxiv.org/abs/2604.16548
What Makes a Good LLM Agent for Real-world Penetration Testing? arXiv:2602.17622. Available at: https://arxiv.org/abs/2602.17622
Daniel Stenberg, Death by a thousand slops (July 2025), curl’s own maintainer analyzing the impact of AI-generated reports on the project’s bug bounty: https://daniel.haxx.se/blog/2025/07/14/death-by-a-thousand-slops/. Formal announcement of the program’s closure: Bleeping Computer, Curl ending bug bounty program after flood of AI slop reports (January 2026), https://www.bleepingcomputer.com/news/security/curl-ending-bug-bounty-program-after-flood-of-ai-slop-reports/

Contents