Back to blog

590x More Test Code Than Production: Is SQLite Insane or Genius?

SQLite has 92 million lines of tests for 155K lines of code. A Twitter debate made me rethink what balanced testing really means.

Last week, a tweet blew up in my feed: SQLite has 155,800 lines of code and 92 million lines of tests. That’s 590 times more test code than actual code.

The replies split into two camps. One group marveled at the dedication. The other, led by formal verification advocates, pushed back: “This crossed the threshold from ‘write more tests’ to ‘use formal verification’ tens of millions of lines ago.”

I’ve landed somewhere in the middle over the years. I believe in test-verified development and solid E2E API tests. Enough to catch regressions, not so much that you’re maintaining a second codebase. But SQLite’s numbers made me question where exactly that line should be.

Then I listened to the CoRecursive podcast episode with Richard Hipp, and everything clicked.

The Aviation Standard That Changed Everything

Richard Hipp didn’t set out to write 92 million lines of tests. The original SQLite was a weekend project. He built it during a government contract hiatus because the database on a Navy battleship kept crashing at the worst moments.

“Suppose a pipe ruptures,” Richard explained. “You need to isolate that damage by closing valves, opening others. In that situation, they don’t want a dialog box that says ‘Cannot connect to database server.’ That’s just not what they want to see.”

So he built SQLite: a database that didn’t need a server. No network call that could fail. If the computer ran at all, the database would work.

But here’s what I didn’t know: early SQLite had plenty of bugs. When Android started shipping it on millions of devices, the crashes piled up.

Then Richard discovered DO-178B, an aviation safety standard for flight-critical software.

What Airlines Know That We Don’t

DO-178B is the quality standard for safety-critical aviation software. The kind of software where a bug means a plane falls out of the sky.

One of its key requirements: 100% MCDC test coverage. That’s Modified Condition/Decision Coverage. Every branch in the compiled binary must be tested both ways. Not just “this function was called,” but “every possible path through the machine code was exercised.”

Richard decided to apply this standard to SQLite. Not because he had to. Because he wanted the bugs to stop.

“It took a year of 60-hour weeks,” he said. “Getting to 95% coverage is pretty easy. That last 5%? Really, really hard.”

Here’s the kicker: once they hit 100%, the bug reports from Android stopped. Just… stopped.

“We just didn’t really have any bugs for the next eight or nine years.”

Eight. Years.

”You’re Just Writing Code to Test Code”

This is where the formal verification folks enter the chat.

Paul Snively’s response to the 590x stat was sharp: “This crossed the threshold from ‘write more tests’ to ‘use formal verification’ tens of millions of lines ago.”

His argument: at some point, you’re writing so much test code that you’re essentially building a second system to verify the first. Test code has bugs too. Who tests the tests?

Formal verification offers a different promise: mathematical proofs that your code is correct. Not “we ran a billion scenarios and nothing broke,” but “we proved this cannot break.”

Projects like seL4 (a formally verified microkernel) show it’s possible. They built proofs showing their implementation matches its specification perfectly.

But here’s the counterargument from the replies: “The problem with formal verification is that the world is messy.”

Databases talk to disks. Disks fail in weird ways. Storage layers have quirks that no formal model captures. You can prove your logic is correct all day, but can you prove the SSD won’t corrupt data in a way nobody’s seen before?

Where I’ve Landed (For Now)

My own thinking on testing has evolved a lot.

Early in my career, I barely tested. “It works on my machine” was good enough. Then I joined a team that practiced TDD religiously, and I saw the confidence that comes from a green test suite. I became a convert.

But I’ve also seen the other extreme: teams drowning in flaky tests, spending more time debugging CI than shipping features. Tests that test implementation details so tightly that any refactor breaks fifty specs.

These days, I believe in a balanced approach:

Test-verified development for critical logic. Not TDD in the strict “write test first for everything” sense, but ensuring the important paths are covered before shipping.

E2E API tests for integration points. These catch the real-world failures, the messy interactions between systems that unit tests miss.

And contextual depth. Every CRUD operation deserves test coverage. The question is what kind and how deep. A user authentication endpoint gets full edge-case coverage. A simple config getter gets a smoke test. Context determines the investment, not some blanket rule about “testing less.”

SQLite’s 590x ratio makes sense for SQLite. It runs on billions of devices, stores your messages, powers your browser. A bug affects millions instantly.

My web app? Different stakes, different strategy.

What Richard Actually Taught Me

The real insight from the podcast wasn’t about hitting some magic coverage number. It was this quote:

“Freedom means taking care of yourself.”

Richard builds his own tools. His own source control. His own bug tracker. Not because he has to, but because it gives him control. No dependency can fail him.

The 92 million lines of tests aren’t about testing for testing’s sake. They’re about confidence. The confidence to ship a new feature without breaking the billions of devices that depend on SQLite.

Formal verification advocates aren’t wrong that there might be a better way. Maybe, eventually, we’ll prove correctness instead of testing it.

But here’s what’s changed since Richard spent that year grinding to 100% coverage: AI tools have dramatically lowered the barrier to comprehensive testing. Writing tests used to be tedious. Now, tools like GitHub Copilot, Claude, and others can generate test cases, suggest edge cases you didn’t think of, and help you hit coverage targets that would have taken weeks in a fraction of the time.

The 590x ratio might have seemed insane a decade ago. Today? It’s more achievable than ever. Not just for well-funded teams, but for solo developers and small startups.

Richard’s approach has a track record: “no serious bugs for eight or nine years.”

That’s not a number. That’s trust. And now, that level of trust is within reach for more of us.


References: