On this page

Engineering7 min read

The false positives were the hard part.

We benchmarked our proactive scan on 28 fresh cases: 20 planted defects, 8 clean controls. 20/20 found, 0/8 false alarms. Full methodology, including what it still misses.

benchmarkproactive-scanai-debugging

On Tuesday we shipped Proactive Scan: one command that ranks your riskiest files and reads each with cross-file context, before anything crashes. Today we're publishing the benchmark behind it. All of it: the numbers, the methodology, the part where our first version cried wolf, and the kinds of bugs it still misses.

The Result

28 test cases, written fresh for this benchmark. 20 files with planted defects, 8 clean files as controls.

CategoryScoreWhat was planted
Single-file defects8/8SQL injection, resource leak, division by zero, mutable default arg, bare except, missing await, off-by-one, hardcoded secret
Cross-file defects12/12wrong argument count, missing export, return-shape mismatch, type mismatch, zero-value config, interface drift, async race, renamed function, generic type misuse, optional and null flow, config modulo, callback arity
False alarms on clean files0/8error handling, executemany, type narrowing, async patterns, optional chaining, class design

Run on July 2, 2026 against our staging API, which runs the same pipeline as production: Voyage embeddings for the codebase index, Claude Haiku 4.5 for analysis.

Note: This is our corpus, scored by us, with n=28. It says the scan does what we designed it to do. It does not say we beat anyone. We ran no competitors, so you will find no competitor numbers here.

Why We Wrote the Cases From Scratch

The lazy way to build this benchmark is to grab known bugs from popular open-source repos. We had exactly that lying around from an internal regression test. We didn't publish it, for one reason: models have read those repos. A detector that recognizes a famous React bug from training data is doing recall, not analysis.

So every one of the 28 cases was written fresh in late June. Original code, original planted defects, patterned after real-world failure classes but not copied from anything public. No model has seen them.

What a Cross File Case Looks Like

The single-file cases are table stakes. Any decent linter flags a bare except. The 12 cross-file cases are the reason the scan exists, because they're invisible unless the tool knows the rest of your project. Here is XF-05, two files, trimmed:

python
# limits.py
# requests per window; 0 disables (but callers divide by it)
WINDOW_SLOTS = 0
python
# rate.py
from limits import WINDOW_SLOTS

def per_slot(total):
    return total / WINDOW_SLOTS

rate.py is flawless on its own. Every line is correct. The bug lives in the relationship: a config value in another file is zero, and this file divides by it. Point a single-file tool at rate.py and it shrugs. The scan indexes both files, pulls limits.py in as context when it reads rate.py, and reports the ZeroDivisionError waiting to happen.

That's the pattern for all 12 cross-file cases: the buggy file is locally clean and the defect only exists given a fact from elsewhere in the project.

How Scoring Works

Each case declares an expected signal: keywords and an issue class the finding must mention (for XF-05: "division by zero" or the offending constant's name). A case counts as detected when the scan reports a finding on the planted file matching that signal. A clean file passes when the scan returns nothing.

Lenient keyword matching means we're grading "did it see the problem", not "did it phrase the problem the way we would". We wrote both the cases and the grader, which is a real conflict, and the mitigation is transparency: the corpus, the expected signals and the runner are published, so you can re-run it and disagree with our scoring.

The False Positive War

The headline number everyone asks about is detection. The number that decides whether anyone keeps the tool installed is false alarms. A scan that flags healthy code trains you to ignore it, and then it's worse than no scan.

Our git history tells the story honestly. Three shipped prompt versions in five days:

  1. scan-v3. First version with the benchmark in place. Found everything, and also "found" style preferences in clean files. A clean error-handling file got flagged for not re-raising. Wolf, cried.
  2. scan-v4. Carved out an entire false-positive class: code that handles external or unknown input defensively was being flagged for the input being unknown. The fix was prompt discipline: a finding must name a concrete failure, with concrete inputs, reachable from the code as written. "This could be risky" is not a finding.
  3. scan-v5. One control file (CLEAN-06, async patterns) still produced an occasional hedge-flag. Hardened the instruction: if the code is correct, say nothing. Zero findings is a valid, good answer.

That last sentence took the longest to make the model believe. Language models want to be helpful, and "I found nothing" doesn't feel helpful. Getting a model to stay quiet on 8 clean files was harder than getting it to talk about 20 buggy ones.

What It Misses

The benchmark tests what we planted. It says nothing about several bug classes we know are out there, and honesty about the boundary is the point of publishing:

  • Bugs that need runtime data. A race that only appears under load, a leak that needs a traffic pattern. Static reading won't catch these; that's what the debugger side of DebugAI is for, after they fire.
  • Spec bugs. Code that does exactly what it says, where what it says is not what the business needed. No tool without your intent can catch these.
  • Scale. Fixture files are small. Production files with 2,000 lines and 40 imports are a harder retrieval problem. We test that separately and it is not in this number.
  • Languages. The corpus is Python and JavaScript/TypeScript, which is what DebugAI supports. Nothing here generalizes beyond that.

Warning: 28 cases is a benchmark, not a proof. A perfect score on 28 cases means the mechanism works, not that the mechanism is perfect.

Reproduce It

The corpus (all 28 cases with expected signals), the runner, and the raw results JSON are public: github.com/1shizaan/debugai-scan-benchmark. Point the runner at your own API key and re-run:

bash
python3 run.py --api-key dbg_your_key_here

If you find a case where our scoring flatters us, open an issue. We'll take the hit in public.

FAQ

Q: Why not compare against Copilot or Semgrep?

A: Because a fair head-to-head is its own project: same corpus, each tool run the way a real user runs it, documented versions, blind scoring, published losses. Rushing that for a launch week would produce exactly the kind of number we don't trust from other vendors. If we do it, it gets its own post.

Q: Doesn't writing your own test cases guarantee a good score?

A: It's a real risk, and the honest answer is partially. We designed cases the scan should catch. The controls are the check on it: a scanner tuned to flag everything would ace detection and fail all 8 clean files. Zero false positives is the half of the score we couldn't game without publishing the evidence.

Q: Which model runs the scan?

A: Claude Haiku 4.5 with our scan-v5 prompt, over a Voyage-embedded index of the project. Same pipeline on the free tier.


Day 4 of Ship Week. The tracker with everything shipped so far: debugai.io/week


Debug faster starting today.

Free VS Code extension. 10 sessions/day. No credit card.

Install Free →

Related Posts

Engineering

GitHub Copilot Just Changed Its Pricing. What Developers Need to Know

5 min read

Engineering

Why Your AI Agent Harness Fails at Debugging (And How to Fix It)

5 min read

← All posts