Cloudflare published details on how it scaled an internal security-review skill into a full autonomous pipeline. The team started with a single skill that ran security scans against repositories, but quickly hit scaling limits — context window constraints and difficulty understanding cross-repository relationships.
They restructured the same seven steps into an autonomous pipeline split across two processes. The first handles discovery: three agents collaborate to read the codebase, map architecture and relationships, then attempt to find and break things, validating findings and deduplicating before producing a report. The second process handles validation and remediation: a separate model (not just a separate session) attempts to disprove or confirm each finding, then a fix is generated, tested, and opened as a PR for human review.
Across more than 100 repositories, the pipeline produced over 20,000 raw findings. After validation, ~13,800 survived; roughly 5,000 were deduplicated and 1,100 were downgraded as low-risk, leaving about 7,200 findings routed to engineering and security teams. The final breakdown: 41 critical, 777 high, with the remainder medium/low.
The structure — separate discovery and validation stages, a different model for adversarial review, and a human-in-the-loop gate before any fix merges — is a useful reference architecture for any team trying to move AI-assisted security review beyond a single-shot skill.