Auditooor: An AI Security Audit Stack

Auditooor is my smart-contract security research stack. Its contract is simple: an audit workspace goes in; ranked detector output, mining briefs, proof artifacts, submission gates, operational memory and retrospectives come out. One operator, one harness, a graded evidence package at the end. This post follows a single run end to end - what each stage does, what it emits, and what it refuses to certify.

[ 01 / Pipeline ]

Three commands in, eight stages through

From cold checkout to a working audit is three commands:

python3 tools/workspace-bootstrap.py --engage-stubs ~/audits/<project>
bash tools/auditooor-session-start.sh ~/audits/<project>
make audit WS=~/audits/<project>

Bootstrap scaffolds the workspace, the session script loads memory recall, and make audit starts stage one. From there, a single driver carries the run all the way to a verdict:

make audit-pipeline-full WS=<workspace>
# 1 audit -> 2 audit-deep -> 3 audit-depth -> 4 hunt -> 5 chain-synth
# -> 6 prove-top-leads -> 7 exploit-conversion-loop -> 8 audit-complete STRICT=1

Eight numbered stages in a fixed order, split four and four. Stages 1 through 3 and stage 8 are deterministic: they always run, and they fail hard on error. Stages 4 through 7 are LLM stages, and they never block the pipeline - run without network consent, each one writes a typed obligation record, hunt_provider_obligation.json, and the run continues. The final gate still executes and reports exactly which signal is missing.

8 numbered stages in one end-to-end run: four deterministic and fail-hard, four consent-gated

~3,827 active detector modules across 20+ language and proof-system families

62 records in the machine-readable rule inventory, including 50 R-rules and 7 L-lessons

25 named completeness signals the final gate requires before it will say "genuinely audited"

All figures counted from the live repo with find, grep and wc - not estimated.

[ 02 / Orient ]

Stage 1: orient

make audit is a multi-stage orient pass - the operator playbook counts 38+ internal stages. It validates the scope files, runs the detector surface, then clusters raw hits by attack class and call site. That clustering turns a flat list of warnings into attacker hypotheses.

The detector surface is the wide end of the funnel: roughly 3,800 active modules. On the Solidity side, about 2,843 files define a Slither detector subclass, auto-compiled from 1,694 YAML pattern specs - one YAML in, one detector out. The Rust side and the other languages are hand-written tree-sitter scans, and a parity gate keeps the two construction styles aligned by canonical bug class. Coverage spans Solidity and EVM, Rust (about 690 modules), Go, Solana, Move and Vyper, plus a full zero-knowledge column: Circom, Halo2, Noir, Plonky2, Plonky3, Cairo, gnark, arkworks, bellperson, RISC Zero, PIL and a generic zkvm wave. The same pipeline audits EVM contracts, Go chains, Rust and Solana programs, Move, Vyper and ZK circuits.

Stage 1 emits the orient package: an engage report (JSON, with a markdown fallback), INTAKE_BASELINE.md, PRIOR_CONCERNS.md - the acknowledged and out-of-scope clusters the run must not refile - per-cluster dispatch briefs, a memory receipt, and an advisory hacker-logic bridge that writes action graphs plus a proof-obligation queue.

[ 03 / Deepen ]

Stage 2: deepen

make audit-deep auto-detects the workspace language and routes accordingly. Solidity gets Slither aggregation plus three real symbolic-execution and fuzzing engines run live: halmos, medusa and echidna. These are actual third-party tools the stack provisions for itself - halmos 0.3.3 in a pip venv, medusa 1.5.1 and echidna 2.3.2 from GitHub releases - and executes against the workspace. Go routes to a Go deep-engine, Rust to a proptest deep-engine plus a zkvm detection pass for proof-system code, and mixed-language workspaces run both.

The same stage builds an invariant ledger with per-function invariants, runs a depth-probe, and bridges its results into the hunt sidecar, so every later stage starts from machine-checked structure.

[ 04 / Depth ]

Stage 3: the depth layer

This is the stage I consider the flagship. make audit-depth enumerates every guard and validation in the in-scope source and asks the question static analysis cannot answer: what does this guard NOT check? Can an input pass it and still violate the invariant it protects?

A negative-space analyzer answers that per guard. A sibling-path differ sweeps for asymmetries between parallel code paths - the pattern where one path validates a condition its sibling forgot - anchored on a real asymmetry between two sibling validation paths in production code. Compact per-guard probe packets go out, and every verdict folds into a single depth_certificate.json, written by exactly one producer.

The certificate is honest by construction. It can only read depth-audited when the negative-space pass ran, the sibling diff ran, every enumerated guard has a probe verdict, and every candidate gap has a disposition: a drafted finding or a source-cited drop. Mechanical-only runs produce depth-pending; nothing run produces depth-not-run; and the producer never upgrades a pending workspace to audited, no matter how green everything else looks.

0 findings is a SMELL, not success.

That line is doctrine, straight from the certificate builder. A silent detector pass is not evidence of absence. The depth layer exists to measure whether the silence was earned.

[ 05 / Hunt ]

Stages 4 to 7: hunt and convert

The four consent-gated stages spend LLM budget only on the surface the deterministic passes left unresolved. The hunt routes per-function work to Haiku through the orchestrator: it generates batch plans, then writes obligation records telling the orchestrator to dispatch each batch as an agent. The scoped hunt defaults to ranked per-function questions, not the full 2,007-question generic corpus.

Chain-synth assembles multi-step exploit chains out of individual leads. Prove-top-leads mines source and builds a proof-task queue. The exploit-conversion loop turns the strongest leads into harness and proof work. And when any of these stages runs without network consent, it writes its typed obligation and steps aside; the pipeline finishes, and the final gate counts the missing signal against the verdict.

The machine records what it could not do rather than silently skipping it.

[ 06 / Evidence ]

The evidence standard

Downstream of every stage sits the doctrine "no runnable PoC, no finding," and it is an evidence standard, not an obstacle. The README's safety rules are explicit: a green run means "no verified finding from this run," not "no bugs." Advisory, planned, scaffolded and memory outputs are not submission evidence until the proof gate executes.

Before anything is filed, the system requires memory recall evidence, scope and out-of-scope clearance, an explicit impact contract, the strongest in-scope escalation attempt, a runnable PoC or fork-replay or live or source proof, a self-contained paste with the PoC code inline, duplicate checks, and a clean run of the pre-submit gate itself.

That gate, pre-submit-check.sh, is around 10,761 lines and runs roughly 111 numbered checks; the highest check number is 135. Each check name encodes a specific lesson from real audit competitions:

Rubric verbatim match or drop. A draft either matches a row in the program's severity rubric or it does not exist.
Non-self impact. Claimed fund loss must hit funds the attacker does not control.
In-process vs node-level. Production-grade claims need a real node surface, not a microbenchmark.
Production-profile PoC. A real database backend on disk, no in-memory stand-ins, no timing shims.
Missing-guard call-site enumeration and duplicate preflight, so one report covers the whole family and never collides with what is already filed.

[ 07 / Rules ]

Lessons, compiled into gates

Behind the checks sits the rule inventory: 62 machine-readable records, among them 50 R-rules and 7 L-lessons. Each record is a first-class row carrying a rule id, a mechanical check number, a tool path, an override marker, a trigger condition, an empirical anchor, a fail-closed flag and a severity floor.

Every rule is two things at once: a hard gate inside the pre-submit pipeline, and a standalone tool I can run on its own. And every rule carries an honest-walk-back escape hatch - a short inline rebuttal marker, capped at 200 characters. The gate blocks over-claims; it never blocks a truthful narrower claim. A rule can be overridden, but never silently.

Every rule is a past lesson compiled into a gate, and every gate has an honest escape hatch.

[ 08 / Verdict ]

Stage 8: the verdict

make audit-complete STRICT=1 is the binding definition of "genuinely audited." The completeness checker - around 6,946 lines on its own - evaluates 25 named pipeline signals: tier-6 commit mining, hunt-complete, live engines, engine-harness proof, the exploit queue, chain-synth, exploit conversion, prove-top-leads, originality, fork divergence, novel-vector work, the adversarial panel, coverage maps, rubric coverage, and function plus cross-function coverage. pass-audit-complete requires every signal present; each gap maps to a specific named fail verdict; STRICT mode fail-closes.

An anti-self-deception layer sits on top. The honesty checker was born from an audit-the-audit pass that caught workspaces reporting fake green: coverage reported as 100 percent that was really about 5.6 percent because budget-skipped units had been counted as covered, and engine runs that were really assert(true) stubs or silent engine errors. It recomputes the honest coverage number and flags hollow or mocked engine execution, with verdicts like pass-genuinely-audited, fail-fake-coverage, fail-hollow-engines and fail-mock-target.

This is the property I would point to first: the stack audits its own honesty, and it refuses to certify a workspace it cannot prove it audited.

[ 09 / Memory ]

Memory with receipts

Under all eight stages sits a queryable cross-session memory graph, not a notes folder: roughly 10,004 markdown notes across 22 categories - patterns, detectors, findings, workspaces, mining, calibration, anti-patterns, agent memory and more. It is served read-mostly over MCP by a server that registers 122 vault callables (6 of them deprecated, each with a named replacement and a cool-off period). Durable writes go through exactly one narrow path: a scoped session token, then a single remember call.

Agents query memory before they read source. Bounded resume, exploit, harness and knowledge-gap context packs are the first thing a human or a model sees on a workspace. And every recall returns a content-addressed context_pack_id and context_pack_hash that must be cited in every commit, worker brief and submission, so the memory the system actually used is traceable and verifiable rather than a black box. The boundary is explicit too: the MCP is not proof, not CI, and not a replacement for reading source. It answers "what should I know before I start or resume this audit?"

[ 10 / Models ]

Four models, measured roles

Multi-model orchestration is role-routed, not a free-for-all. Kimi handles bounded source and spec extraction and long-context comparison. MiniMax handles adversarial kill-review and false-positive pressure. Claude handles implementation, harnesses and docs. Codex sits at the end as the skeptical final promotion gate, and Haiku takes the scoped per-function hunt. No provider has severity or submission authority: everything any model produces is advisory until local source, scope, PoC and gate checks pass.

Routing is calibration-driven and empirical. Every task type carries a status - route-ok, advisory-only, do-not-route or no-data - based on measured true-positive rates over recorded verdicts. Source extraction routes to Kimi at 23 of 23 true. Adversarial kill-review routes to MiniMax at 45 of 53. Gap-finding is do-not-route, because both candidates fell below the 70 percent floor. Incident history lives inline in the manifest: a model once generated 91 fake passing fixtures that all smoke-passed, and that single incident is why fixture-repair task types are pinned no-data. Provider lessons are treated as model-specific - after a model upgrade, the system demands a fresh calibration slice before it trusts the old routing - and default model ids are resolved centrally, surfaced in a capacity report alongside budget ceilings and per-provider recommendation flags.

[ 11 / Output ]

What one run hands me

Two more capabilities thread through every run. Tier-6 commit mining works the target's git history in both directions around the audit-pin commit: forward, for post-pin fixes that might collide with a finding; backward, for pre-pin fix attempts - reverted, incomplete or symptom-only - which are live bug-class seeds. Every mined commit is classified landed-and-stuck, reverted, symptom-only or refactor-touched. And originality is checked before and after proof investment, with verdicts pass-novel, pass-distinguished, block-published-cve and block-published-disclosure, so I drop, pivot to a distinguishing variant, or cite the prior disclosure before spending days on a PoC that collides with a public one.

The output package of a full run, concretely:

The orient set. An engage report, an intake baseline, prior-concerns clusters and per-cluster hacker briefs.
The work queues. A proof-obligation queue, an exploit queue, and language-correct engine run artifacts.
The depth and synthesis layer. A depth certificate plus synthesized multi-step exploit chains.
The verdicts. An honesty report and the 25-signal completeness verdict.
The paste. Only if the proof gates pass: a platform-shaped paste-ready file, internal labels stripped, with a six-directive impact contract block.

The system never auto-submits; the operator pastes. That last manual step is deliberate. Every finding that leaves this machine has survived eight stages, 25 completeness signals, roughly 111 submission checks and 62 compiled lessons - and then a human still reads it before it goes anywhere. That is what "trusts nothing" means in practice, and it is exactly why I trust what comes out.