Three diagnoses. All of them plausible. All of them wrong.
We’ve been building something internally, not because a client asked us to, not to fill a pipeline gap or prove a concept, but because we wanted to stay sharp. The kind of building where you’re using your own tools, in real conditions, with real stakes, and nobody’s waiting on you except you. It keeps your instincts honest in a way that client work, for all its pressure, sometimes can’t.
At some point during that process, we hit a bug. A multi-city itinerary was only rendering two cities instead of four, and a date was showing up wrong in a way that was obviously broken but frustratingly invisible. So we did what everyone does now. We handed it to the AI.
The first diagnosis was coherent: a save-path issue, probably a mount freeze overwriting data on load. Plausible, well-reasoned, and wrong. The second was a merge preference bug in a specific file at a specific line range, also plausible, also wrong. The third was a variation on the second with different line numbers. Each one arrived with confidence, each came with a proposed fix, and none were validated against anything observable before being offered as the answer. Because each failed attempt stayed in the conversation history, the next diagnosis was built on top of the previous wrong ones, the context window filling with the residue of bad theories, the AI reasoning on top of its own mistakes with no awareness that was happening.
The failure mode isn’t confusion. It’s fluency.
When AI doesn’t know something, it rarely says so in a way that registers. What it does instead is generate a confident, well-structured, internally consistent explanation of the wrong thing, and because the explanation is fluent, it reads like knowledge. It reads like the answer someone gives when they’ve seen this problem before.
This is where it gets complicated. AI doesn’t just fool the people watching it. It fools the people running it too, especially those who are new enough to the work that they haven’t developed a calibrated sense of what right actually feels like. One of the stranger side effects of this moment is that access to AI tools has given a lot of people the sensation of expertise without the underlying experience that makes expertise real. If you’ve never built enough things to develop a nose for when something confident is actually just fluent, you have no instrument for catching the difference. Experience isn’t only about knowing the right answers. It’s about having been wrong in enough specific ways that you’ve developed pattern recognition for when something is about to go sideways, even when it sounds completely certain.
What broke the loop in our case wasn’t better prompting. It was stopping entirely and going to look at the actual data, reading the stored values directly from the database, demanding observable evidence before trusting any further reasoning from a system that had already been wrong three times without knowing it. The stored data was correct. That single observation collapsed both theories simultaneously, narrowed the problem to a specific gap between reading and rendering, and the fix turned out to be one line.
Three confident diagnoses spanning hours. One line. That’s the ratio.
The last 20% has always been the hard part
Pareto figured this out over a century ago. The last twenty percent of any hard thing consumes a disproportionate share of everything you have, and AI hasn’t changed that ratio. It’s just made the first eighty percent radically more accessible. Faster, cheaper, available to more people than ever before. That’s genuinely remarkable.
But accessibility cuts both ways. The same tools that get a first-time founder to a working prototype in weeks also compress the distance between having built something and knowing how to finish it. The 80% is democratized. The 20% is not. That last stretch, where the product needs to stop being a collection of correct pieces and start being a coherent thing, where judgment calls determine whether it actually works, still requires the kind of pattern recognition that only comes from having been there before.
The ceiling is real, and it’s not a technical problem. It’s a judgment problem, and judgment is exactly what gets counterfeited by AI fluency in a way that fools people who haven’t yet earned the pattern recognition to catch it. Knowing when to stop trusting the output, when the conversation has gone stale, when confident explanation isn’t the same as confirmed fact: these aren’t skills you can prompt into existence.
That’s what we bring. Not just the speed, because anyone can get you the speed now. The judgment that makes the speed matter.
