Autonomous Offensive Security

One Command. Full Pentest.

PentestPilot is an automated web application security testing platform built for professional penetration testers and bug bounty hunters. Point it at a target and it runs a full engagement autonomously — from authentication and reconnaissance through exploitation, AI-powered validation, and final reporting.

26 integrated security tools. 11-phase pipeline. Multi-agent AI validation that debates every finding before it reaches your report. No interaction required between start and finish.

Every finding ships with reproduction steps, evidence bundles, and OWASP WSTG coverage mapping. The system argues with itself so you don't have to argue with your client.

Launch Mission Control See How It Works

$ pentestpilot scan https://target.com --full

[00:00] Scan started — target: target.com

[00:14] AuthLadder: 3 sessions captured (admin, user, readonly)

[01:22] Recon: Katana + GAU + JS analysis → 247 endpoints

[02:15] Discovery: 89 injection vectors across 17 types

[03:41] SQLi confirmed: /api/users?id=1 (error-based, schema dumped)

[04:12] SSTI confirmed: /render?template= (Jinja2, RCE possible)

[06:30] Validation: 5-stage pipeline — 12/14 findings confirmed

[08:15] OWASP sweep: 109 WSTG test IDs evaluated, 3 gaps filled

[12:44] 29 specialists deployed, consensus reached on 8 criticals

[16:20] Expansion: admin session → +47 endpoints, 3 additional vulns

[18:42] Report complete — 15 findings, evidence bundles attached

[18:42] Scan finished in 18m 42s

Security Tools

Pipeline Phases

109

WSTG Test IDs

Validation Stages

Real Output From a Real Scan

These are actual screenshots from a completed scan. Not mockups.

Scan Overview

31 validated findings, 84 AI validation runs, severity breakdown with priority findings. One-click access to findings, endpoints, AI traces, and WSTG coverage.

Findings List

Every finding tagged with severity, AI VERIFIED status, and HTTP method. Filter by severity or validation state.

AI Validation Trace

84 total validations. 44 confirmed, 40 rejected. Every decision logged with confidence score and reasoning chain.

OWASP Web Security Testing Guide coverage matrix showing 109 WSTG test IDs tracked across 12 security categories

WSTG Coverage Matrix

Every WSTG test ID tracked across 12 categories. Green = tested. Coverage percentage per category. Gap analysis built in.

Attack Surface Graph

Visual exploit chain mapping. Critical attack paths, potential exposure scoring, remediation priority. Interactive full-screen view.

Agent Tool Registry

28 tools across 6 categories. Each specialist agent gets tools from its assigned categories. Toggle tools on/off per scan.

11-Phase Autonomous Pipeline

From credential exhaustion to recursive post-compromise expansion. No human interaction between start and report.

AuthLadder

8-step credential exhaustion: seeded creds, credential store, LLM login agent (Playwright), OSINT harvest, default creds, self-registration, full spray. Post-login verification on every attempt.

Reconnaissance

Katana SPA crawl, GAU historical URLs, Gobuster brute-force, EyeWitness screenshots, tech fingerprinting, JS AST parsing with framework-specific route extraction.

URL Collection

Canonicalized URL mapping. Merges crawled, JS-extracted, SPA, and Wayback endpoints. Arjun hidden parameter discovery. Normalized deduplication.

Injection Discovery

17+ injection vector types catalogued. Risk metadata per parameter. Reflected parameter detection with differential response analysis.

Classification

LLM + heuristic risk scoring, WSTG category mapping. Authenticated re-crawl discovers protected endpoints and feeds them back through discovery.

Active Testing

SQLMap, Dalfox, Nuclei, SSTImap, LFImap, SSRFmap, XXEinjector, ZAP, Playwright DOM XSS, BOLA replay, deserialization. Failed-tool fallback pass.

AI Validation

5-stage pipeline: protected findings gate, definitive evidence signals, LangGraph ReAct replay, adversarial review, confidence blending.

OWASP Sweep

12 WSTG categories, 109 test IDs. Dedicated agent per category. Coverage gap analysis identifies untested controls and schedules targeted tests.

AI Orchestrator

29 specialist agents deployed. Multi-round debates, cross-specialist challenges, adversarial refutation, pentest judge loop. PBFT consensus on high-severity findings.

Expansion

Recursive post-compromise loop. Each captured session triggers authenticated recon, delta computation, and a fresh testing cycle. Up to 3 iterations.

Reporting

Executive summary, reproduction steps, evidence bundles (curl scripts, payloads, screenshots), WSTG coverage matrix, severity-ranked aggregation.

Under the Hood

Not an LLM wrapper. A distributed AI system that debates, validates, and learns across scans.

Multi-Agent Consensus

Multiple specialist agents with different biases independently evaluate each finding. No single agent can confirm or reject a vulnerability unilaterally.

PBFT-inspired 4-phase protocol with 2f+1 quorum
Weighted voting based on agent track record accuracy
Adversarial agent actively tries to disprove every finding
Devil's advocate agent proposes alternative explanations

Attack Chain Reasoning

Causal graph models how vulnerabilities chain together, with strength scores and conditions. Edge types: ENABLES, AMPLIFIES, WEAKENS_DEFENSE.

XSS + no HttpOnly → session hijacking (0.9 strength)
SQLi + auth query → authentication bypass (0.95 strength)
Counterfactual analysis: "If CSP is strict, XSS chain breaks"
Impact propagation across full exploit graph

Knowledge Graph RAG

NetworkX graph (endpoints, parameters, vulnerabilities, sessions) fused with vector embeddings. Retrieval weights adapt per query type via Reciprocal Rank Fusion.

SITEMAP queries: graph-only (structural traversal)
PAYLOAD queries: vector-heavy (semantic similarity)
VULNERABILITY queries: 50/50 hybrid fusion
Graph edges: HAS_PARAM, VULNERABLE_TO, CHAINS_TO

Cross-Scan Learning

Confirmed findings become learning events. Successful payloads seed future tests. Debate outcomes feed back into the RAG system. The platform gets better with each scan.

Hypothesis store: verified + refuted patterns persist
Payload mutation: working payloads seed future fuzzing
Cross-domain: "jQuery XSS on site A, test on site B"
Stall detection: MD5 fingerprinting catches reasoning loops

How Findings Are Validated

Every finding passes through a multi-stage gauntlet. Once confirmed, it can never be downgraded by later analysis.

Protected Findings Gate WSTG-backed rules block invalid AI rejections at the boundary

Definitive Evidence Signals Command output, OAST callbacks, status code differentials bypass debate

ReAct Agent Replay LangGraph agent with tool access re-executes the finding independently

Direct Binary Verdict Separate model instance delivers a clean true/false with justification

Adversarial Refutation Red team agent proposes alternative explanations for every positive

Multi-Agent Consensus High-severity findings require 2f+1 quorum from biased specialist agents

Monotonic Confirmation Once ai_verified = true, no subsequent analysis can downgrade it

26 Tools, Orchestrated

Go binaries compiled from source (Go 1.25, CGO-enabled). Python tools git-cloned at build. Every tool has a typed integration wrapper with structured output parsing.

Go Binaries (compiled from source)

Dalfox

Nuclei

Interactsh

ffuf

GAU

dnsx

Naabu

Katana

Webanalyze

Gobuster

Waybackurls

GoWitness

Python / Ruby Tools (git-cloned at build)

SQLMap

SSTImap

LFImap

SSRFmap

XXEinjector

Liffy

EyeWitness

Framework Integrations

OWASP ZAP

Playwright

Retire.js

Arjun

Nmap

Exploitation

John the Ripper

ysoserial

phpggc