Stop Blaming the Model: The Harness Problem in AI Coding

Every week there’s a new model drop. A new leaderboard shakeup. A new Twitter thread declaring that this one finally writes code that doesn’t make you want to close your laptop and take a long walk. And every week, engineers jump ship to the latest thing, hoping this time their AI coding assistant will stop mangling indentation, forgetting imports, or hallucinating function signatures.

But what if the problem was never the model?

A post that recently blew up on Hacker News made a simple, almost uncomfortably obvious argument: the biggest bottleneck in AI-assisted coding isn’t the model itself, it’s the harness. The tooling that wraps the model, feeds it context, and translates its outputs into actual edits on your codebase. And a researcher named Can Bölük proved it by improving the performance of fifteen different LLMs in a single afternoon, without retraining any of them. He just changed how edits work.

This idea has been rattling around in my head ever since. Because honestly? I’ve felt this friction myself. Cursing at Claude for botching a straightforward refactor, switching to Gemini mid-task, then switching back, and just assumed the model was being dumb. I never once stopped to consider that the thing between me and the model was the actual culprit. Turns out, I was blaming the pilot for the landing gear.

What we mean by “the harness”

When you use an AI coding tool whether it’s Claude Code, Cursor, Aider, or some open-source agent you found on GitHub, there’s a layer between you and the language model that most people never think about. That layer handles a surprising number of things: how the model sees your code, what tools it has available to make edits, how errors are reported back, how state is managed across turns, and how the model’s suggested changes actually get applied to your files.

This is the harness. And it turns out, it’s where most failures actually happen.

Think about it this way. The model reads your code, understands the bug, knows exactly what line to change. Great. But now it has to express that change through whatever edit format the harness forces on it. And this is where things go sideways.

The edit format problem is worse than you think

There are roughly three mainstream approaches to how AI coding tools let models make edits, and none of them are great.

The first is the patch-style approach, where the model writes something like a diff. OpenAI’s Codex uses a variant of this. The problem? If the model hasn’t been specifically fine-tuned to produce that particular diff syntax, it fails constantly. Bölük’s benchmarks showed that Grok 4 had a patch failure rate of over 50%. Not because Grok 4 is a bad model, it just doesn’t speak that particular dialect.

The second is the string-replace approach, which is what Claude Code and many other tools use. Find the exact old text, swap in the new text. Simple in theory. Nightmarish in practice, because the model has to reproduce every character of the original code perfectly including whitespace, indentation, trailing spaces, the works. If anything doesn’t match exactly, the edit fails. There’s literally a megathread of GitHub issues about the infamous “String to replace not found in file” error. Anyone who’s used an AI coding tool for more than a few hours has hit this wall.

The third approach is what Cursor does: they trained an entirely separate 70B-parameter model just to handle applying edits. That’s right and the problem is so hard that a company with hundreds of millions in funding decided the answer was to throw another neural network at it. And even then, they’ve acknowledged that just rewriting the whole file outperforms their diff model for files under 400 lines.

The aider project’s own benchmarks tell a similar story. Just switching GPT-4 Turbo from one edit format to another nearly doubled its success rate from 26% to 59%. The same format applied to GPT-3.5 only got 19%. The format matters as much as the model, sometimes more.

The hashline idea (and why it’s clever)

Bölük’s solution is elegantly simple. When the model reads a file, every line comes back tagged with a short content hash – a two or three character identifier tied to that line’s content:

11:a3|function hello() {
22:f1|  return "world";
33:0e|}

When the model wants to make an edit, it references those tags instead of reproducing the original text. “Replace line 2:f1 with this.” “Insert after 3:0e.” If the file has changed since the model last read it, the hashes won’t match and the edit gets rejected before anything gets corrupted.

It’s such a small shift in perspective, but it removes the entire class of problems where the model “knows” what to change but can’t express it mechanically. The model doesn’t need to reproduce old content character-by-character anymore. It just needs to recall a short tag and state what should replace it.

The benchmark results are striking. Across sixteen models and 180 tasks per run, hashline matched or beat string-replace for most models, and absolutely demolished the patch format. One model, Grok Code Fast 1, went from a 6.7% success rate to 68.3%, a tenfold improvement, because the old format was failing so catastrophically that the model’s actual coding ability was completely hidden behind mechanical formatting failures.

What this means for how we evaluate (and choose) AI tools

This finding should make all of us rethink how we evaluate AI coding assistants. When a tool gives you a garbage edit, the instinct is to say “this model sucks.” But the real question is: did the model fail to understand the task, or did it fail to express the answer through a clunky interface?

These are very different problems with very different solutions. One requires better models (expensive, slow, requires massive training runs). The other requires better engineering (comparatively cheap, fast, and entirely in our control).

I’ve personally wasted entire afternoons switching between models after a frustrating coding session, when the problem was almost certainly the tooling the whole time. It’s a humbling realization. We’re so conditioned to think about AI in terms of model capability that we forget the most basic principle of systems design: a system is only as good as its weakest link. And right now, for a lot of people, that weakest link is sitting right there in the harness, silently eating success rates.

I think there’s a broader lesson here about where engineering effort is best spent. The AI industry is pouring billions into making models marginally smarter, when in many real-world applications, the bottleneck is the last-mile tooling. The glue code. The harness. The boring stuff that nobody writes blog posts about, except apparently now.

It reminds me of something I’ve seen over and over in software engineering: the highest-leverage improvements often aren’t in the core algorithm. They’re in the interface between the algorithm and the real world. Database query performance isn’t just about the query planner but how the ORM generates queries. Web performance isn’t just about the rendering engine but how your build pipeline bundles assets. And AI coding performance isn’t just about the model but how the harness lets the model do its job.

The open-source angle matters

There’s a subplot to this story that’s worth paying attention to. Bölük points out that no vendor will optimize their harness for a competitor’s model. Anthropic won’t tune for Grok. OpenAI won’t tune for Gemini. But open-source harnesses tune for everything, because contributors use different models and fix the failures they personally encounter.

This is a genuinely compelling argument for why AI coding tools need healthy open-source ecosystems. If the harness is as important as the model and the evidence suggests it is then locking down APIs and blocking third-party tools is actively counterproductive. You’re preventing the community from doing free R&D that makes your own model look better.

So what should you actually do with this information?

If you’re building with AI coding tools, a few practical takeaways. First, don’t just chase the latest model. The harness matters at least as much, probably more for day-to-day reliability. Second, if you’re hitting frequent edit failures, the problem might not be the model at all – experiment with different tools, edit formats, or wrappers before switching models. Third, if you’re building your own AI tooling, invest serious thought into the edit interface. It’s not a solved problem, and small changes here can dwarf the impact of entire model upgrades.

And finally, pay attention to open-source AI tooling projects. The aider leaderboard is a fantastic resource for understanding how different models perform with different edit formats, and projects like oh-my-pi are doing genuinely important work at the tool boundary.

The gap between “impressive demo” and “reliable daily tool” has never been about model intelligence. It’s about everything that sits between the model and your codebase. The harness problem is real, it’s measurable, and it’s the most underrated leverage point in AI-assisted engineering right now.

Next time your AI coding tool butchers an edit, take a breath before you blame the model. The smartest engineer in the room can still look incompetent if you hand them a broken keyboard.

Fix the harness.

References

Can Bölük, “I Improved 15 LLMs at Coding in One Afternoon. Only the Harness Changed.” — blog.can.ac
Aider Code Editing Leaderboard — aider.chat/docs/leaderboards
oh-my-pi: Open-source model-agnostic coding agent — github.com/can1357/oh-my-pi

Stop Blaming the Model: Why the Harness Is the Real Bottleneck in AI-Assisted Coding

What we mean by “the harness”

The edit format problem is worse than you think

The hashline idea (and why it’s clever)

What this means for how we evaluate (and choose) AI tools

The open-source angle matters

So what should you actually do with this information?

References

More posts

Stop Blaming the Model: Why the Harness Is the Real Bottleneck in AI-Assisted Coding

The Hidden Gems in Ambiguity: Why Engineers Should Embrace the Gray Areas

Building AI Investigation Agents: Lessons From Automating Root Cause Analysis at Scale

AI-Assisted Software Development: Lessons From the Trenches