One question is starting to come up more frequently in conversations about penetration testing, and it is an entirely reasonable one.
People are querying why, if Claude, Codex Security, or any number of other AI tools can autonomously scan a codebase, identify vulnerabilities, and generate reports, do you still need to engage a pen tester? It is a fair question, but the answer is nuanced and deserves a depth of discussion (and is certainly more nuanced than the vendors selling AI security tooling or the sceptics dismissing it would have you believe).
There is no denying that AI security tools are genuinely useful, and their capability is advancing fast. However, there is a significant gap between what is being marketed and what is actually happening in practice, and that gap has real consequences for how you think about your security posture.
What AI Tooling Actually Does Well
AI-assisted security tooling has made meaningful improvements in specific areas. Automated scanning of codebases at scale, identification of known vulnerability patterns across large dependency trees, and the speed at which findings can be surfaced have all improved considerably.
Claude Code Security and Codex Security, both launched earlier this year, are capable tools and worth investing in, as part of a broader security programme. Claude Opus 4.8 has recently launched, and Claude Mythos, OpenAI’s Daybreak, and perhaps even Microsoft’s Codename MDASH (currently in “expanded preview”, as announced last week) promise further step-change improvement of those.
The positives are tangible; a tool that can scan your entire codebase for instances of a known vulnerability class in minutes, rather than the hours a human tester would require, is genuinely adding value. The same applies to dependency analysis; artificial intelligence can cross-reference your third party plugins and dependencies against known CVEs at a speed and scale that no manual process can match.
But where the marketing tends to overreach is in presenting these capabilities as a substitute for structured, standards-based penetration testing rather than as a complement to it.
The Noise-to-Signal Problem
The first practical limitation is one the industry is already grappling with. As we covered in our piece on the AI Vulnerability Storm, the curl project famously shut down its bug bounty programme for a period because it was drowning in hallucinated, AI-generated vulnerability reports. It has since reversed that decision, and the quality of AI-generated findings is improving, but the underlying problem has not disappeared: more output does not automatically mean more actionable advice.
For SaaS teams in particular, this has more of an effect than it might elsewhere. Security functions at SaaS businesses are often lean relative to the size and complexity of what they are responsible for, and triage has a real cost. An AI-generated report that produces hundreds of potential issues without clear or accurate prioritisation or confidence ratings does not reduce your team’s workload; it redistributes and potentially confuses it. Someone still has to work through what is real, what is exploitable in your specific environment, and what might have to wait.
The signal-to-noise ratio is improving, but it is still a real operational constraint. And as AI tooling becomes more widely deployed across both offensive and defensive security, the volume of findings, real and otherwise, is only going to increase; not just in your code, but in all the many dependencies your application rests upon, in some cases from the operating system up.
The Assurance Gap
The second limitation is more fundamental, and it is the one that tends to get lost in conversations about AI security tooling.
An AI scan of your codebase, however thorough, does not produce an Assurance Report that refers to a particular testing methodology. It does not tell you that your application has been tested against a standard such as the OWASP ASVS (Application Security Verification Standard), that qualified and experienced testers with relevant certifications reviewed the findings, or that the testing covered your full attack surface in a systematic and auditable way.
An AI scan cannot produce the kind of documentation that you hand to an auditor, a governance board, or a prospective enterprise customer who wants evidence that your security programme is robust, considered, and meets a recognised standard. And even if (when) it can, it probably isn’t what they are actually after.
For SaaS businesses, this is also a commercial reality. Enterprise procurement processes and compliance frameworks are increasingly asking for that evidence explicitly, and “we ran an AI scan” does not satisfy the question in the way a standards-based pen test report does.
The assurance gap is not just an internal security concern; it shows up in sales cycles, in customer due diligence questionnaires, and in conversations with auditors. Who provides the stamp of approval on an AI-generated security report? At the moment, no one does, and that is not a gap that better AI tooling resolves on its own.
The Defender’s Dilemma
Here is where the picture becomes more urgent for technical leaders thinking about their security posture, because the AI tooling question is not just about your pen testing programme it is about the environment you are operating in.
The same capabilities that make artificial intelligence useful for defensive scanning are being pointed at your infrastructure by people who do not have your best interests in mind. As the AI Vulnerability Storm briefing made clear, the open source dependencies your SaaS product is built on are already being scanned at scale by AI systems capable of finding vulnerabilities that went undetected by humans and their previous tools for decades. The Linux kernel went from roughly two vulnerability reports per week to ten. A CVSS 9.8 flaw in OpenSSL that had existed since 1998 was discovered by a frontier AI system in 2026.
SaaS products carry a specific exposure profile that makes this particularly consequential. A continuously deployed, API-first application with a multi-tenant architecture presents a broad and constantly shifting attack surface. Your codebase is not a periodic release; it is a live target, updated frequently (through dependency updates or vulnerabilities, if nothing else), with authentication boundaries, tenant isolation, and data access logic that need to hold under conditions that automated tooling alone is not well placed to fully characterise. A vulnerability that is low severity in isolation can become critical in a multi-tenant context where the blast radius extends across your entire customer base.
This creates a specific problem that AI tooling alone does not solve. You may find yourself receiving, or needing to respond to, an avalanche of vulnerability reports; from AI-assisted researchers, from automated scanning tools, from dependency advisories triggered by findings in your software supply chain. The volume of legitimate, high-severity findings is increasing faster than most security teams are resourced to absorb.
The real defender’s dilemma is not just that attackers are getting better tools. It is that the signal is getting louder on all sides simultaneously, and prioritisation becomes the critical capability. Which findings represent genuine, exploitable risk to your specific application and environment? Which dependencies are actually reachable by an attacker? What is your real attack surface, as opposed to the surface that any automated tool can see?
Those are questions that require judgment, context, and structured methodology to answer reliably. An AI report can tell you what it found. It cannot tell you what it missed, or why what it found matters in the specific context of how your application is deployed and used.
What This Means for Your Testing Programme
None of this means that AI tooling should be ignored or deprioritised. Running AI-assisted scanning against your own codebase before an adversary does is a reasonable and increasingly accessible capability, and we encourage this as part of a layered approach. But it sits alongside and in between regular, structured penetration testing, not in place of it.
The value of a rigorous, standards-based pen test in this environment is not diminished by the availability of AI scanning tools. If anything, the increasing volume and sophistication of automated findings makes the human judgment layer even more valuable. Someone needs to contextualise, prioritise, and provide a defensible account of what your actual security posture is, in terms that matter to your business and your stakeholders.
The organisations that will navigate the current period most effectively are those that treat AI tooling as an input into their security programme, not the programme itself, and that maintain the kind of continuous, structured assurance that lets them answer the harder questions. It’s not just a question of “what did the scan find?” but “do we have confidence in the full picture of our attack surface?”
That question is still one for humans to answer, and the bar for answering it credibly is rising, not falling.
If you want to explore this topic in more depth, including the chance to put your questions directly to our director, we are running a webinar next month on The Role of AI in Modern Pen Testing: Register here
