It's way too early to make firm predictions here, but if you're not already in the field it's helpful to know there's been 20 years of effort at automating "pen-testing", and the specific subset of testing this project focused on (network pentesting --- as opposed to app pentesting, which targets specifically identified network applications) is already essentially fully automated.
I would expect over the medium term agent platforms to trounce un-augmented human testing teams in basically all the "routinized" pentesting tasks --- network, web, mobile, source code reviews. There are too many aspects of the work that are just perfect fits for agent loops.
The app automated pentest scanners find the bottom 10-20% of vulns, no real pentester would consider them great. Agents might get us to 40%-50% range, what they are really good at is finding "signals" that the human should investigate.
I agree with you about scanners (we banned them at Matasano), but not about the ceiling for agents. Having written agent loops for somewhat similar "surface and contextualize hypotheses from large volumes of telemetry" problems, and, of course, having delivered hundreds of application pentests: I think 80-90% of all the findings in a web pentest report, and functionally all of the findings in a netpen report, are within 12-18 months reach of agent developers.
I agree with the prediction. The key driver here isn't even model intelligence, but horizontal scaling. A human pentester is constrained by time and attention, whereas an agent can spin up 1,000 parallel sub-agents to test every wild hypothesis and every API parameter for every conceivable injection. Even if the success rate of a single agent attempt is lower than a human's, the sheer volume of attempts more than compensates for it.
They also don't fatigue in the same way humans do. Within the constraint of a netpen, a human might be, say, 20% more creative at peak performance than an agent loop. But an agent loop will operate within a narrow band of its own peak performance throughout the whole test, on every stimulus/response trial it does. Humans cannot do that.
Honestly I'm just trying to be nice about it. I don't know that I can tell you a story about the 90% ceiling that makes any sense, especially since you can task 3 different high-caliber teams of senior software security people on an app and get 3 different (overlapping, but different) sets of vulnerabilities back. By the end of 2027, if you did a triangle test, 2:1 agents/humans or vice/versa, I don't think you'd be able to distinguish.
I see this emoji thing being mentioned a lot recently, but I don't remember ever seeing one. Granted I rarely use AI and when I do it's on duck.ai. What models are (ab)using emojis?
I'd say I agree with you there for the low-hanging fruit. The deep research (there's an image filter here but we can bypass it by knowing some obscure corner of the SVG spec) is where they still fall over and need hand holding by pointing them at the browser rendering stack, specs, etc
It doesn't even need to be trained. Just feed parts of the spec. I found some interesting implementation edge cases just by submitting the source and pdf spec of a chip to Claude. Not even in a fancy way.
So the stuff that agents would excel at is essentially just the "checklist" part of the job? Check A, B, C, possibly using tools X, Y, Z, possibly multi-step checks but everything still well-defined.
Whereas finding novel exploits would still be the domain of human experts?
Well, agents can't discover bypass attacks because they don't have memory. That was what DNCs [1] (Differentiable Neural Computers) tried to accomplish. Correlating scan metrics with analytics is btw a great task for DNCs and what they are good at due to how their (not so precise) memory works. Not so much though at understanding branch logic and their consequences.
However, I currently believe that forensic investigations will change post LLMs, because they're very good at translating arbitrary bytecode, assembly, netasm, intel asm etc syntax to example code (in any language). It doesn't have to be 100% correct in those translations, that's why LLMs can be really helpful for the discovery phase after an incident. Check out the ghidra MCP server which is insane to see real-time [2]
The lack of memory issue is already being solved architecturally, and ARTEMIS is a prime example. Instead of relying on the model's context window (which is "leaky"), they use structured state passed between iterations. It's not a DNC per se, but it is a functional equivalent of long-term memory. The agent remembers it tried an SQL injection an hour ago not because it's in the context, but because it's logged in its knowledge base. This allows for chaining exploits, which used to be the exclusive domain of humans
SSL Heartbleed is a good example. Or pretty much any vulnerability that needs understanding of how memset or malloc works, or anything where you have to use leaky functions to create a specific offset because that's where the return (eip) in assembly is, so that you can modify/exploit that jmp or cmp call.
These kind of things are very hard for LLMs because they tend to forget way too much important information about both the code (in the branching sense) and the program (in the memory sense).
I can't provide a schematic for this, but it's pretty common in binary exploitation CTF events, and kind of mandatory knowledge about exploit development.
I listed some nice CTFs we did with our group in case you wanna know more about these things [1]. I think in regards to LLMs and this bypass/sidechannel attacks topic I'd refer to the Fusion CTF [2] specifically, because it covers a lot of examples.
Wait, I don't understand why Heartbleed is at all hard for an agent loop to uncover. There's a pattern for these attacks (we found one in nginx in the ordinary course of a web app pentest at Matasano --- and we didn't find it based on code, though I don't concede that an LLM would have a hard time uncovering these kinds of issues in code either).
I think people are coming to this with the idea that a pentesting agent is pulling all its knowledge of vulnerabilities and testing patterns out of its model weights. No. The whole idea of a pentesting agent is that the agent code --- human-mediated code that governs the LLM --- encodes a large amount of knowledge about how attacks work.
I think I'd differ between source code audits (where LLMs already are pretty good at spotting bugs if you can convince them to) and exploit development here.
The former is automated by a large part already with fuzz testing of all kinds, so you wouldn't need an LLM if you knew what you were doing and have a TDD workflow or similar that checks against memleaks (say, with valgrind or similar approaches).
The latter part is what I was referring to where I had hope initially that DNCs could help with that, and what I'd say that right now LLMs cannot discover this, only repeat and translate it (e.g. similar vulnerabilities in the past discovered by humans in another programming language).
I'm talking specifically about discovery here because transformers lose symbolic inference, and that's why you can't use them for exploit generation. At least I wasn't able to make them work for the DARPA challenges, and had to use an AlphaGo based model combined with a CPPN and some techniques that worked in ES/HyperNEAT.
I suppose what I'm trying to say is that there's a missing understanding of memory and time when it comes to LLMs. And that is usually manually encoded/governed how you put it by humans. And I would not count that as an LLM doing it, because you could have just automated the tool use without an LLM and get identical results. (When thinking e.g. about an MCP for kernel memory maps or say, valgrind or AFL etc)
We're talking about different things here. A pentesting agent directly tests running systems. It's a (much) smarter version of Burp Scanner. It's going to find memory disclosure vulnerabilities the same way pentesters do, by stimulus/response testing. You can do code/test fusion to guide stimulus/response, which will make them more efficient, but the limiting factor here isn't whether transformers lose symbolic inference.
Remember, the competition here is against human penetration testers. Humans are extremely lossy testing agents!
If the threshold you're setting is "LLMs can eradicate memory disclosure bugs by statically analyzing codebases to the point of excluding those vulnerabilities as valid propositions", no, of course that isn't going to happen. But nothing on the table today can do that either! That's not the right metric.
I'm bullish on novel exploits too but I'm much less confident in the prediction. I don't think you can do two network pentests and not immediately reach the conclusion that the need for humans to do significant chunks of that work at all is essentially a failure of automation.
With more specificity: I would not be at all surprised if the "industry standard" netpen was 90%+ agent-mediated by the end of this year. But I also think that within the next 2-3 years, that will be true of web application testing as well, which is in a sense a limited (but important and widespread) instance of "novel vulnerability" discovery.
With exploits, you'll have to go through the rote stuff of checklisting over and over, until you see aberrations across those checklist and connect the dots.
If that part of the job is automated away. I wonder how the talent and skill for finding those exploits will evolve.
It's not way too early, imo. This is the academic nerds proof of concept for a school research project, it's not the "group of elite hackers get together and work out a world-class production ready system".
Agent platforms have similar modes of failure, whether it's creative writing, coding, web design, hacking, or any other sort of project scaffolding. A lot of recent research has dealt with resolving the underlying gaps in architectures and training processes, and they've had great success.
I fully expect frontier labs to have generalized methodologizing capabilities withing the first half of the year, and by the end of the year, the Pro/Max/Heavy variants of the chatbots will have the capabilities baked in fully. Instead of having Codex or Artemis or Claude Code, you can just ask the model to think and plan your project, whatever the domain, and get professional class results, as if an expert human was orchestrating the project.
All sorts of complex visual tool use like PCB design and building plans and 3d modeling have similar process abstractions, and the decomposition and specialized task executions are very similar in principle to the generalized skills I mentioned. I think '26 is going to be exciting as hell.
Note that gpt-5 in a standard scaffold (Codex) lost to almost everyone, while in the ARTEMIS scaffold, it won. The key isn't the model itself, but the Triage Module and Sub-agents. Splitting roles into "Supervisor" (manager) and "Worker" (executor) with intermediate validation is the only viable pattern for complex tasks. This is a blueprint for any AI agent, not just in cybersec
If you can do it by splitting roles explicitly, you can fold it into a unified model too. So "scaffolding advantage" might be a thing now, but I don't expect it to stay that way.
Is this true? I mean it’s true for any specific workflow, but I am not clear it’s true for all workflows - the power set of all workflows exceeds any single architecture, in my mind.
It's not true for all workflows. But many of today's custom workflows are like the magic "let's think step by step" prompt for the early LLMs. Low-hanging fruits, set to become redundant as better agentic capabilities are folded into the LLMs themselves.
Think of it in an end-to-end way: produce a ton of examples of final results of supervisor-worker agentic outputs and then train a model to predict those from the original user prompts straight away.
I work in this space. The productivity gains from LLMs are real, but not in the "replace humans" direction.
Where they shine is the interpretive grunt work: "help me figure out where the auth logic is in this obfuscated blob", "make sense of this minified JS", "what's this weird binary protocol doing.", "write me a Frida script to hook these methods and dump these keys" Things that used to mean staring at code for hours or writing throwaway tooling now takes a fraction of the time.
They're straight up a playing field leveler.
Folks with the hacker's mindset but without the programming chops can punch above their weight and find more within the limited time of an engagement.
Sure they make mistakes, and will need babysitting a lot. But it's getting better. I expect more firms to adopt them as part of their routine.
> The productivity gains from LLMs are real, but not in the "replace humans" direction.
It might be the beer talking, but everytime someone comments on AI they have to say something along the lines of "LLM do help". If i'm being really honest, the fact everyone has to mention this in every comment and every blog post and every presentation is because deep down everyone isn't buying it.
Heavy use in CTFs doesn't surprise me at all. CTFs often throw curveballs or weird technologies that contestants might not be familiar with. Now you can get a starting point on what's going on, or how something works, instantly from an LLM and it's not a major problem if the LLM is wrong, you may just lose a little time.
Which makes me think: yes, llms can solve some of this, but still only some. It's more than a research tool, when you combine tools and agentic workflows. I don't see a reason it should slow down.
The article is literally about how much/if AI help. There is literally only two possible opinions someone can have on the subject: either they do or they don't.
I feel like it's more because the detractors are very loudly against it and the promoters are very loudly exaggerating the capabilities. Meanwhile, as a bystander who is realistic and is actually using it, you have moments where it's absolutely magnificent and insanely useful and other moments where it kinda sucks, which leads to the somewhat reluctant conclusion that:
> The productivity gains from LLMs are real, but not in the "replace humans" direction.
Meanwhile the people who are explicitly on a side either say that there are no productivity gains or that nobody will have jobs in 6 months.
Or maybe they do, but they don't want to get drawn into a totally derailing side conversation about the future of humanity and global warming and it's just a tiny acknowledgement that hey, you can throw an obfuscated blob of minified JavaScript at it and it can take it apart with way less effort from a human, which gets you to the interesting part of the RE question faster than if you had to do it by hand. By all means, don't buy it. I'm not the one getting left behind, however.
It does help A LOT in the case of security research. Particularly.
For example, I tended to avoid pen testing freelance work before AI because I didn't enjoy the tedious work of reading tons of documentation about random platforms to try to understand how they worked and searching all over StackOverflow.
Now with LLMs, I can give it some random-looking error message and it can clearly and instantly tell me what the error means at a deep tech level, what engine was used, what version, what library/module... I can pen test platforms I have 0 familiarity with.
I just know a few platforms, engines, programming languages really well and I can use this existing knowledge to try to find parallels in other platforms I've never explored before.
The other day, on HackerOne, I found a pretty bad DoS vulnerability in a platform I'd never looked into before, using an engine and programming language I never used professionally; I found the issue within 1 hour of starting my search.
Yes and at least 30 more minutes to write the report; with the help of LLM. So it still required my analysis skills but at least I was able to do it, relatively fast... Whereas I wouldn't even have considered doing this kind of stuff before due to the hassle associated with research...
There are multiple factors which are pulling me into cybersecurity.
Firstly, it requires less effort from me.
Secondly, the amount of vulnerabilities seems to be growing exponentially... Possibly in part because of AI.
Maybe in the future when labs train more specifically on offensive work, lots of hand holding needed right now.
Even simple stuff like training the models to recognize when they're stuck and should just go clone a repo or pull up the javadocs instead of hallucinating their way through or trying simple internet searches.
> The AI bot trounced all except one of the 10 professional network penetration testers the Stanford researchers had hired to poke and prod, but not actually break into, their engineering network.
Oh, wow!
> Artemis found bugs at lightning speed and it was cheap: It cost just under $60 an hour to run. Ragan says that human pen testers typically charge between $2,000 and $2,500 a day.
Wow, this is great!
> But Artemis wasn’t perfect. About 18% of its bug reports were false positives. It also completely missed an obvious bug that most of the human testers spotted in a webpage.
Oh, hm, did not trounce the professionals, but ok.
False positives on netpens are extremely common, and human netpen people do not generally bill $2k days. Netpen work is relatively low on the totem pole.
(There is enormous variance in what clients actually pay for work; the right thing, I think, to key off of is comp rates for people who actually deliver work.)
As a data point, when I worked in consulting 10+ years ago doing network (internet/ext), web app, mobile etc our day rate was $2k AUD flat for anything we did, and AFAIK we were at or below market cost. I know for sure that the big four charged closer to $3000 for what I understood to be a worse service (I have nothing to back that up apart from occasionally seeing awful reports). We did not an insubstantial amount of netpen at that amount. Granted, AUD isn’t USD, but I wonder what their day rate is now.
My experience of UK pentest rates was that they've stagnated or even gone down over the last 20-25 years.
In the early 2000's banks were paying ~£1000-£1200/day for pentesters from boutiques and when I stopped being in that industry ~5 years ago, it was largely the same or even lower for larger companies that could negotiate day-rates down. Big-4 tried to charge more but that's really tricky when you're in direct competition with boutiques who have more testers than you.
By contrast US rates were a lot higher ($2k+/day) and also scopes were larger. A UK test for a web app could be as low as 3 days (even less for unauthenticated) where the US tended to be 1-2 weeks.
One reason they've gone down is outsourcing to lower cost regions, and I'd guess that LLM/AI automation will accelerate that trend...
Fair, but if you look at most tools for Static Code Analysis they will have equal or worse performance with regards to false positives and are still seen as added value.
If this is inexpensive (in terms of cost/time) it will likely make business sense even with false positives.
But that isn’t the claim. The claim is an agentic pen tester “trounced” human testers. Static analysis tools are already trivial and cheap to automate, why would you need an agent in the loop?
I agree with your point that the claim is exagerated. My counterpoint is even if they are subpar, they will still make business sense if they are inexpensive, much in the same way that Static code analysis tools aren't great but because they are inexpensive they still make sense during development.
I don't read a lot of papers, but to me this one seems iffy in spots.
> A1 cost $291.47 ($18.21/hr, or $37,876/year at 40 hours/week). A2 cost $944.07 ($59/hr, $122,720/year). Cost
contributors in decreasing order were the sub-agents, supervisor and triage module. *A1 achieved similar vulnerability
counts at roughly a quarter the cost of A2*. Given the average U.S. penetration tester earns $125,034/year [Indeed],
scaffolds like ARTEMIS are already competitive on cost-to-performance ratio.
The statement about similar vulnerability counts seems like a straight up lie. A2 found 11 vulnerabilities with 9 of these being valid. A1 found 11 vulnerabilities with 6 being valid. Counting invalid vulerabilities to say the cheaper agent is as good is a weird choice.
Also the scoring is suspect and seems to be tuned specifically to give the AI a boost, heavily relying on severity scores.
Also kinda funny that the AI's were slower than all the human participants.
they'll still get their bonus, and they dgaf if you don't have a job, because the number of goobers attending online for-profit schools for a "security degree" is endless
Bootstrap founder in that field. Fully autonomous is just not there. The winner for this "generation" will be with human in the loop / human augmentation IMO. When VC money dries out there will be a pile of autonomous ai pentest compagnies in it.
Yes because all the valuations right now are based on a bet that this will replace a huge chunk of the service/consulting budget toward an AI budget for pentest. This will not happen.
I have no stake in this market, but: human-in-the-loop AI-mediated pentesting will absolutely slaughter billable hours for offensive security talent. Hell, if Fortify and Burp Scanner were actually good, you wouldn't even need the last few years of LLM advancement to accomplish that; the problem is that current automation is not very good. LLM-augmented automation happens, as a weird quirk of fate, to be almost laser-guided at the weaknesses of that technology.
That markets been slaughtered for a while. Pretty much every big tech company has built up strong internal security teams and automated as much as possible. Look up what happened to NCC Group post Matasano acquisitions, I joined within a year of the isec/matasano/intrepedus acquisitions and saw a slow ride down. After 5 years the rate was still $2500 a day and everyone with real talent left to internal teams for much much higher pay. NCC Group is now a scan shop operating out of the phillipines, I still have one friend that works there from the isec days! The exception being some leet places like Trail-Of-Bits.
Late-period NCC doesn't look great. But I've been a buyer of these services for the past 5 years (a seller, of course, for the 15 years leading up to that) and rates have not gone down; I was shocked at how much we ended up spending compared to what we would have billed out on comparable projects at Matasano.
I don't know enough about the low-end market to rebut you there (though: I saw what my muni paid for a bargain-basement assessment and was not OK with it), but the high end of the market definitely has not been slaughtered, and I definitely think that is coming.
Yes and no, it will kill the "I ran a nessus scanner and charged you 8k for it" kind of pentests but not the core of the service market IMO. Pentesters will be more efficient so I guess this could be considered a slash in hourly rate if they kept the same pace. LLM are good at getting signals but actual hacking it is still meh.
Juniors will have a hard time that I agree. The current level of findings of LLM is at their level.
I disagree with you about the first paragraph but have to say that, distinctively to the security and the services markets, you can't say "juniors will have a hard time of it" without also saying "this is going to fundamentally disrupt services budgets". The two statements mean the same thing.
Im currently on the tail end of building out an agentic hacking framework; I wanted to learn the best practices of building agents (I have an SDK with memory (short/med/long), knowledgegraph/RAG, tools and plugins that makes it easy to develop new agents the orchestrator can coordinate).
I also wanted to capture what's in my head from doing bug bounties (my hobby) and 15+ years in appsec/devsecops to get it "on paper". If anyone would like to kick the tires, take a look, or tell me it's garbage feel free to email me (in my profile).
Do I read it right, that ARTEMIS required a not insignificant amount of hints in order to identify the same vulnerabilities that the human testers found? (P. 7 of the PDF.)
Pen testing and cyber security in general shares characteristics with some other fields in which AI performs well compared to humans: it requires mastery of a body of knowledge that's barely manageable by humans. Law, medicine, and other professions where we send people to graduate school to get good at unnatural mental tasks are similar.
With this model, the 'Security researcher' becomes a middleman between AI agents, tech companies and hackers. We need a new term; 'Cybersecurity broker.'
so how much of a factor is it that safety guardrails may be keeping the current models from achieving higher scores in whatever red teaming benchmarks exist?
You joke, but that's a very real approach that AI pentesting companies do take: an agent that creates reports, and an agent that 'validates' reports with 'fresh context' and a different system prompt that attempts to reproduce the vulnerability based on the report details.
*Edit: the paper seems to suggest they had a 'Triager' for vulnerability verification, and obviously that didn't catch all the false positives either, ha.
At my first job, all the applications the data people developed were compulsorily evaluated through Fortify (I assume this is HP Fortify) and to this day I have no idea what the security team was actually doing with the product, or what the product does. All I know is that they never changed anything even though we were mostly fresh grads and were certainly shipping total garbage.
It's like, when you say agents will largely be relegated to "triage" --- well, a pretty surprising amount of nuts and bolts infosec work is basically just triage!
It's way too early to make firm predictions here, but if you're not already in the field it's helpful to know there's been 20 years of effort at automating "pen-testing", and the specific subset of testing this project focused on (network pentesting --- as opposed to app pentesting, which targets specifically identified network applications) is already essentially fully automated.
I would expect over the medium term agent platforms to trounce un-augmented human testing teams in basically all the "routinized" pentesting tasks --- network, web, mobile, source code reviews. There are too many aspects of the work that are just perfect fits for agent loops.
The app automated pentest scanners find the bottom 10-20% of vulns, no real pentester would consider them great. Agents might get us to 40%-50% range, what they are really good at is finding "signals" that the human should investigate.
I agree with you about scanners (we banned them at Matasano), but not about the ceiling for agents. Having written agent loops for somewhat similar "surface and contextualize hypotheses from large volumes of telemetry" problems, and, of course, having delivered hundreds of application pentests: I think 80-90% of all the findings in a web pentest report, and functionally all of the findings in a netpen report, are within 12-18 months reach of agent developers.
I agree with the prediction. The key driver here isn't even model intelligence, but horizontal scaling. A human pentester is constrained by time and attention, whereas an agent can spin up 1,000 parallel sub-agents to test every wild hypothesis and every API parameter for every conceivable injection. Even if the success rate of a single agent attempt is lower than a human's, the sheer volume of attempts more than compensates for it.
They also don't fatigue in the same way humans do. Within the constraint of a netpen, a human might be, say, 20% more creative at peak performance than an agent loop. But an agent loop will operate within a narrow band of its own peak performance throughout the whole test, on every stimulus/response trial it does. Humans cannot do that.
I wonder how the baseline for 100% is established - are there (security relevant) software that you'd say are essentially free of vulnerabilities?
Nope! It's extremely unknowable.
Would be curious to hear your hypothesis on what's the remaining 10-20% that might be out of reach? Business logic bugs?
Honestly I'm just trying to be nice about it. I don't know that I can tell you a story about the 90% ceiling that makes any sense, especially since you can task 3 different high-caliber teams of senior software security people on an app and get 3 different (overlapping, but different) sets of vulnerabilities back. By the end of 2027, if you did a triangle test, 2:1 agents/humans or vice/versa, I don't think you'd be able to distinguish.
Just registering the prediction.
I would take the other side of that bet.
I don't understand what you're trying to say here.
Just that the superficial details of how AI communicate (e.g. with lots of emojis) might give them away in any triangle test :)
Ah! Touche.
I see this emoji thing being mentioned a lot recently, but I don't remember ever seeing one. Granted I rarely use AI and when I do it's on duck.ai. What models are (ab)using emojis?
I'd say I agree with you there for the low-hanging fruit. The deep research (there's an image filter here but we can bypass it by knowing some obscure corner of the SVG spec) is where they still fall over and need hand holding by pointing them at the browser rendering stack, specs, etc
Until those obscure corner cases are fed into the next training round.
It doesn't even need to be trained. Just feed parts of the spec. I found some interesting implementation edge cases just by submitting the source and pdf spec of a chip to Claude. Not even in a fancy way.
So the stuff that agents would excel at is essentially just the "checklist" part of the job? Check A, B, C, possibly using tools X, Y, Z, possibly multi-step checks but everything still well-defined.
Whereas finding novel exploits would still be the domain of human experts?
Well, agents can't discover bypass attacks because they don't have memory. That was what DNCs [1] (Differentiable Neural Computers) tried to accomplish. Correlating scan metrics with analytics is btw a great task for DNCs and what they are good at due to how their (not so precise) memory works. Not so much though at understanding branch logic and their consequences.
However, I currently believe that forensic investigations will change post LLMs, because they're very good at translating arbitrary bytecode, assembly, netasm, intel asm etc syntax to example code (in any language). It doesn't have to be 100% correct in those translations, that's why LLMs can be really helpful for the discovery phase after an incident. Check out the ghidra MCP server which is insane to see real-time [2]
[1] https://github.com/JoergFranke/ADNC
[2] https://github.com/LaurieWired/GhidraMCP
The lack of memory issue is already being solved architecturally, and ARTEMIS is a prime example. Instead of relying on the model's context window (which is "leaky"), they use structured state passed between iterations. It's not a DNC per se, but it is a functional equivalent of long-term memory. The agent remembers it tried an SQL injection an hour ago not because it's in the context, but because it's logged in its knowledge base. This allows for chaining exploits, which used to be the exclusive domain of humans
Can you be more specific about the kind of "bypass attack" you think an agent can't find? Like, provide a schematic example?
SSL Heartbleed is a good example. Or pretty much any vulnerability that needs understanding of how memset or malloc works, or anything where you have to use leaky functions to create a specific offset because that's where the return (eip) in assembly is, so that you can modify/exploit that jmp or cmp call.
These kind of things are very hard for LLMs because they tend to forget way too much important information about both the code (in the branching sense) and the program (in the memory sense).
I can't provide a schematic for this, but it's pretty common in binary exploitation CTF events, and kind of mandatory knowledge about exploit development.
I listed some nice CTFs we did with our group in case you wanna know more about these things [1]. I think in regards to LLMs and this bypass/sidechannel attacks topic I'd refer to the Fusion CTF [2] specifically, because it covers a lot of examples.
[1] https://cookie.engineer/about/writeups.html
[2] https://exploit.education/fusion/
Wait, I don't understand why Heartbleed is at all hard for an agent loop to uncover. There's a pattern for these attacks (we found one in nginx in the ordinary course of a web app pentest at Matasano --- and we didn't find it based on code, though I don't concede that an LLM would have a hard time uncovering these kinds of issues in code either).
I think people are coming to this with the idea that a pentesting agent is pulling all its knowledge of vulnerabilities and testing patterns out of its model weights. No. The whole idea of a pentesting agent is that the agent code --- human-mediated code that governs the LLM --- encodes a large amount of knowledge about how attacks work.
I think I'd differ between source code audits (where LLMs already are pretty good at spotting bugs if you can convince them to) and exploit development here.
The former is automated by a large part already with fuzz testing of all kinds, so you wouldn't need an LLM if you knew what you were doing and have a TDD workflow or similar that checks against memleaks (say, with valgrind or similar approaches).
The latter part is what I was referring to where I had hope initially that DNCs could help with that, and what I'd say that right now LLMs cannot discover this, only repeat and translate it (e.g. similar vulnerabilities in the past discovered by humans in another programming language).
I'm talking specifically about discovery here because transformers lose symbolic inference, and that's why you can't use them for exploit generation. At least I wasn't able to make them work for the DARPA challenges, and had to use an AlphaGo based model combined with a CPPN and some techniques that worked in ES/HyperNEAT.
I suppose what I'm trying to say is that there's a missing understanding of memory and time when it comes to LLMs. And that is usually manually encoded/governed how you put it by humans. And I would not count that as an LLM doing it, because you could have just automated the tool use without an LLM and get identical results. (When thinking e.g. about an MCP for kernel memory maps or say, valgrind or AFL etc)
We're talking about different things here. A pentesting agent directly tests running systems. It's a (much) smarter version of Burp Scanner. It's going to find memory disclosure vulnerabilities the same way pentesters do, by stimulus/response testing. You can do code/test fusion to guide stimulus/response, which will make them more efficient, but the limiting factor here isn't whether transformers lose symbolic inference.
Remember, the competition here is against human penetration testers. Humans are extremely lossy testing agents!
If the threshold you're setting is "LLMs can eradicate memory disclosure bugs by statically analyzing codebases to the point of excluding those vulnerabilities as valid propositions", no, of course that isn't going to happen. But nothing on the table today can do that either! That's not the right metric.
> Humans are extremely lossy testing agents!
Ha, I laughed at that one. I suppose you're right :D
I'm bullish on novel exploits too but I'm much less confident in the prediction. I don't think you can do two network pentests and not immediately reach the conclusion that the need for humans to do significant chunks of that work at all is essentially a failure of automation.
With more specificity: I would not be at all surprised if the "industry standard" netpen was 90%+ agent-mediated by the end of this year. But I also think that within the next 2-3 years, that will be true of web application testing as well, which is in a sense a limited (but important and widespread) instance of "novel vulnerability" discovery.
With exploits, you'll have to go through the rote stuff of checklisting over and over, until you see aberrations across those checklist and connect the dots.
If that part of the job is automated away. I wonder how the talent and skill for finding those exploits will evolve.
They suck at collecting the bounty money because they can't legally own a bank account.
It's not way too early, imo. This is the academic nerds proof of concept for a school research project, it's not the "group of elite hackers get together and work out a world-class production ready system".
Agent platforms have similar modes of failure, whether it's creative writing, coding, web design, hacking, or any other sort of project scaffolding. A lot of recent research has dealt with resolving the underlying gaps in architectures and training processes, and they've had great success.
I fully expect frontier labs to have generalized methodologizing capabilities withing the first half of the year, and by the end of the year, the Pro/Max/Heavy variants of the chatbots will have the capabilities baked in fully. Instead of having Codex or Artemis or Claude Code, you can just ask the model to think and plan your project, whatever the domain, and get professional class results, as if an expert human was orchestrating the project.
All sorts of complex visual tool use like PCB design and building plans and 3d modeling have similar process abstractions, and the decomposition and specialized task executions are very similar in principle to the generalized skills I mentioned. I think '26 is going to be exciting as hell.
Note that gpt-5 in a standard scaffold (Codex) lost to almost everyone, while in the ARTEMIS scaffold, it won. The key isn't the model itself, but the Triage Module and Sub-agents. Splitting roles into "Supervisor" (manager) and "Worker" (executor) with intermediate validation is the only viable pattern for complex tasks. This is a blueprint for any AI agent, not just in cybersec
If you can do it by splitting roles explicitly, you can fold it into a unified model too. So "scaffolding advantage" might be a thing now, but I don't expect it to stay that way.
Is this true? I mean it’s true for any specific workflow, but I am not clear it’s true for all workflows - the power set of all workflows exceeds any single architecture, in my mind.
It's not true for all workflows. But many of today's custom workflows are like the magic "let's think step by step" prompt for the early LLMs. Low-hanging fruits, set to become redundant as better agentic capabilities are folded into the LLMs themselves.
Think of it in an end-to-end way: produce a ton of examples of final results of supervisor-worker agentic outputs and then train a model to predict those from the original user prompts straight away.
I work in this space. The productivity gains from LLMs are real, but not in the "replace humans" direction.
Where they shine is the interpretive grunt work: "help me figure out where the auth logic is in this obfuscated blob", "make sense of this minified JS", "what's this weird binary protocol doing.", "write me a Frida script to hook these methods and dump these keys" Things that used to mean staring at code for hours or writing throwaway tooling now takes a fraction of the time. They're straight up a playing field leveler.
Folks with the hacker's mindset but without the programming chops can punch above their weight and find more within the limited time of an engagement.
Sure they make mistakes, and will need babysitting a lot. But it's getting better. I expect more firms to adopt them as part of their routine.
> The productivity gains from LLMs are real, but not in the "replace humans" direction.
It might be the beer talking, but everytime someone comments on AI they have to say something along the lines of "LLM do help". If i'm being really honest, the fact everyone has to mention this in every comment and every blog post and every presentation is because deep down everyone isn't buying it.
"Having the opposing opinion means deep down, you agree with my opinion"
Wow banger of an argument.
In GP's defense:
>>It might be the beer talking, ...
Have you asked anybody who writes exploits full time whether they use LLMs?
Yes and Yes. I was suprised walking around the DefCon CTF room lat year and half the screens were AI chats of some sort.
Heavy use in CTFs doesn't surprise me at all. CTFs often throw curveballs or weird technologies that contestants might not be familiar with. Now you can get a starting point on what's going on, or how something works, instantly from an LLM and it's not a major problem if the LLM is wrong, you may just lose a little time.
In fact: https://wilgibbs.com/blog/defcon-finals-mcp/
Which makes me think: yes, llms can solve some of this, but still only some. It's more than a research tool, when you combine tools and agentic workflows. I don't see a reason it should slow down.
The article is literally about how much/if AI help. There is literally only two possible opinions someone can have on the subject: either they do or they don't.
I'm not really sure what you are expecting here.
I feel like it's more because the detractors are very loudly against it and the promoters are very loudly exaggerating the capabilities. Meanwhile, as a bystander who is realistic and is actually using it, you have moments where it's absolutely magnificent and insanely useful and other moments where it kinda sucks, which leads to the somewhat reluctant conclusion that:
> The productivity gains from LLMs are real, but not in the "replace humans" direction.
Meanwhile the people who are explicitly on a side either say that there are no productivity gains or that nobody will have jobs in 6 months.
Or maybe they do, but they don't want to get drawn into a totally derailing side conversation about the future of humanity and global warming and it's just a tiny acknowledgement that hey, you can throw an obfuscated blob of minified JavaScript at it and it can take it apart with way less effort from a human, which gets you to the interesting part of the RE question faster than if you had to do it by hand. By all means, don't buy it. I'm not the one getting left behind, however.
It does help A LOT in the case of security research. Particularly.
For example, I tended to avoid pen testing freelance work before AI because I didn't enjoy the tedious work of reading tons of documentation about random platforms to try to understand how they worked and searching all over StackOverflow.
Now with LLMs, I can give it some random-looking error message and it can clearly and instantly tell me what the error means at a deep tech level, what engine was used, what version, what library/module... I can pen test platforms I have 0 familiarity with.
I just know a few platforms, engines, programming languages really well and I can use this existing knowledge to try to find parallels in other platforms I've never explored before.
The other day, on HackerOne, I found a pretty bad DoS vulnerability in a platform I'd never looked into before, using an engine and programming language I never used professionally; I found the issue within 1 hour of starting my search.
Did you spend another hour confirming your understanding?
Yes and at least 30 more minutes to write the report; with the help of LLM. So it still required my analysis skills but at least I was able to do it, relatively fast... Whereas I wouldn't even have considered doing this kind of stuff before due to the hassle associated with research...
There are multiple factors which are pulling me into cybersecurity.
Firstly, it requires less effort from me. Secondly, the amount of vulnerabilities seems to be growing exponentially... Possibly in part because of AI.
Not in the replace humans direction yet?
Maybe in the future when labs train more specifically on offensive work, lots of hand holding needed right now.
Even simple stuff like training the models to recognize when they're stuck and should just go clone a repo or pull up the javadocs instead of hallucinating their way through or trying simple internet searches.
From WSJ article:
> The AI bot trounced all except one of the 10 professional network penetration testers the Stanford researchers had hired to poke and prod, but not actually break into, their engineering network.
Oh, wow!
> Artemis found bugs at lightning speed and it was cheap: It cost just under $60 an hour to run. Ragan says that human pen testers typically charge between $2,000 and $2,500 a day.
Wow, this is great!
> But Artemis wasn’t perfect. About 18% of its bug reports were false positives. It also completely missed an obvious bug that most of the human testers spotted in a webpage.
Oh, hm, did not trounce the professionals, but ok.
False positives on netpens are extremely common, and human netpen people do not generally bill $2k days. Netpen work is relatively low on the totem pole.
(There is enormous variance in what clients actually pay for work; the right thing, I think, to key off of is comp rates for people who actually deliver work.)
As a data point, when I worked in consulting 10+ years ago doing network (internet/ext), web app, mobile etc our day rate was $2k AUD flat for anything we did, and AFAIK we were at or below market cost. I know for sure that the big four charged closer to $3000 for what I understood to be a worse service (I have nothing to back that up apart from occasionally seeing awful reports). We did not an insubstantial amount of netpen at that amount. Granted, AUD isn’t USD, but I wonder what their day rate is now.
My experience of UK pentest rates was that they've stagnated or even gone down over the last 20-25 years.
In the early 2000's banks were paying ~£1000-£1200/day for pentesters from boutiques and when I stopped being in that industry ~5 years ago, it was largely the same or even lower for larger companies that could negotiate day-rates down. Big-4 tried to charge more but that's really tricky when you're in direct competition with boutiques who have more testers than you.
By contrast US rates were a lot higher ($2k+/day) and also scopes were larger. A UK test for a web app could be as low as 3 days (even less for unauthenticated) where the US tended to be 1-2 weeks.
One reason they've gone down is outsourcing to lower cost regions, and I'd guess that LLM/AI automation will accelerate that trend...
Fair, but if you look at most tools for Static Code Analysis they will have equal or worse performance with regards to false positives and are still seen as added value.
If this is inexpensive (in terms of cost/time) it will likely make business sense even with false positives.
But that isn’t the claim. The claim is an agentic pen tester “trounced” human testers. Static analysis tools are already trivial and cheap to automate, why would you need an agent in the loop?
I agree with your point that the claim is exagerated. My counterpoint is even if they are subpar, they will still make business sense if they are inexpensive, much in the same way that Static code analysis tools aren't great but because they are inexpensive they still make sense during development.
We cannot consider this report unbiased considering the authors are selling the product.
I don't read a lot of papers, but to me this one seems iffy in spots.
> A1 cost $291.47 ($18.21/hr, or $37,876/year at 40 hours/week). A2 cost $944.07 ($59/hr, $122,720/year). Cost contributors in decreasing order were the sub-agents, supervisor and triage module. *A1 achieved similar vulnerability counts at roughly a quarter the cost of A2*. Given the average U.S. penetration tester earns $125,034/year [Indeed], scaffolds like ARTEMIS are already competitive on cost-to-performance ratio.
The statement about similar vulnerability counts seems like a straight up lie. A2 found 11 vulnerabilities with 9 of these being valid. A1 found 11 vulnerabilities with 6 being valid. Counting invalid vulerabilities to say the cheaper agent is as good is a weird choice.
Also the scoring is suspect and seems to be tuned specifically to give the AI a boost, heavily relying on severity scores.
Also kinda funny that the AI's were slower than all the human participants.
WSJ always writes in this clickbaity way and its getting constantly worse.
An Exec is gonna read this and start salvating at the idea of replacing security teams.
We're right in the danger zone where AI isn't good enough to replace you, but it's definitely good enough to convince executives that it can.
they'll still get their bonus, and they dgaf if you don't have a job, because the number of goobers attending online for-profit schools for a "security degree" is endless
The particular kind of work in this report is not what most security teams do at all.
Bootstrap founder in that field. Fully autonomous is just not there. The winner for this "generation" will be with human in the loop / human augmentation IMO. When VC money dries out there will be a pile of autonomous ai pentest compagnies in it.
Seriously: is this a meaningful distinction?
Yes because all the valuations right now are based on a bet that this will replace a huge chunk of the service/consulting budget toward an AI budget for pentest. This will not happen.
I have no stake in this market, but: human-in-the-loop AI-mediated pentesting will absolutely slaughter billable hours for offensive security talent. Hell, if Fortify and Burp Scanner were actually good, you wouldn't even need the last few years of LLM advancement to accomplish that; the problem is that current automation is not very good. LLM-augmented automation happens, as a weird quirk of fate, to be almost laser-guided at the weaknesses of that technology.
That markets been slaughtered for a while. Pretty much every big tech company has built up strong internal security teams and automated as much as possible. Look up what happened to NCC Group post Matasano acquisitions, I joined within a year of the isec/matasano/intrepedus acquisitions and saw a slow ride down. After 5 years the rate was still $2500 a day and everyone with real talent left to internal teams for much much higher pay. NCC Group is now a scan shop operating out of the phillipines, I still have one friend that works there from the isec days! The exception being some leet places like Trail-Of-Bits.
Late-period NCC doesn't look great. But I've been a buyer of these services for the past 5 years (a seller, of course, for the 15 years leading up to that) and rates have not gone down; I was shocked at how much we ended up spending compared to what we would have billed out on comparable projects at Matasano.
I don't know enough about the low-end market to rebut you there (though: I saw what my muni paid for a bargain-basement assessment and was not OK with it), but the high end of the market definitely has not been slaughtered, and I definitely think that is coming.
Yes and no, it will kill the "I ran a nessus scanner and charged you 8k for it" kind of pentests but not the core of the service market IMO. Pentesters will be more efficient so I guess this could be considered a slash in hourly rate if they kept the same pace. LLM are good at getting signals but actual hacking it is still meh.
Juniors will have a hard time that I agree. The current level of findings of LLM is at their level.
I disagree with you about the first paragraph but have to say that, distinctively to the security and the services markets, you can't say "juniors will have a hard time of it" without also saying "this is going to fundamentally disrupt services budgets". The two statements mean the same thing.
Do you think they could move toward other technologies if they show maturity in that sector that AI cannot provide ?
Im currently on the tail end of building out an agentic hacking framework; I wanted to learn the best practices of building agents (I have an SDK with memory (short/med/long), knowledgegraph/RAG, tools and plugins that makes it easy to develop new agents the orchestrator can coordinate).
I also wanted to capture what's in my head from doing bug bounties (my hobby) and 15+ years in appsec/devsecops to get it "on paper". If anyone would like to kick the tires, take a look, or tell me it's garbage feel free to email me (in my profile).
Do I read it right, that ARTEMIS required a not insignificant amount of hints in order to identify the same vulnerabilities that the human testers found? (P. 7 of the PDF.)
Pen testing and cyber security in general shares characteristics with some other fields in which AI performs well compared to humans: it requires mastery of a body of knowledge that's barely manageable by humans. Law, medicine, and other professions where we send people to graduate school to get good at unnatural mental tasks are similar.
I'd actually argue its the opposite. Pentesting (especially at a basic level) is very repetitive and does not require much knowledge.
> Pentesting (especially at a basic level) is very repetitive and does not require much knowledge.
load framework, run scripts, copy-paste screenshots, give presentation.
the juniors aren't doing scoping calls and follow-ups, unless the top-kick needs explanations
I think we're sort of saying the same thing. A solo practice lawyer with a Westlaw form book would be the equivalent of your basic pen tester.
With this model, the 'Security researcher' becomes a middleman between AI agents, tech companies and hackers. We need a new term; 'Cybersecurity broker.'
Already taken, by people who broker exploits.
Taken both in name and role, more or less.
18 dollars an hour is quite steep considering LLMs are loss leaders.
I wouldnt be surprised if they get near cost parity. Maybe 20% difference.
You can give an agent access to RevEng tools, spend 1k on API calls and be no better off
so how much of a factor is it that safety guardrails may be keeping the current models from achieving higher scores in whatever red teaming benchmarks exist?
Sounds like they need another agent to detect false positives (I joke, I joke)
You joke, but that's a very real approach that AI pentesting companies do take: an agent that creates reports, and an agent that 'validates' reports with 'fresh context' and a different system prompt that attempts to reproduce the vulnerability based on the report details.
*Edit: the paper seems to suggest they had a 'Triager' for vulnerability verification, and obviously that didn't catch all the false positives either, ha.
Can't be any worse than Fortify was!
At my first job, all the applications the data people developed were compulsorily evaluated through Fortify (I assume this is HP Fortify) and to this day I have no idea what the security team was actually doing with the product, or what the product does. All I know is that they never changed anything even though we were mostly fresh grads and were certainly shipping total garbage.
It's like, when you say agents will largely be relegated to "triage" --- well, a pretty surprising amount of nuts and bolts infosec work is basically just triage!