Skip to main content

Why 100% Agentic Solutions Suck (and Will Keep Sucking)

We all hear how developers will be obsolete. I tried 100% agentic and lost three days rolling back. Here is the human in the loop ai autonomy gradient that actually ships.

By Vadim Sharapov8 min read
agentic-codingai-context-engineeringfounder-essays

We all hear how developers will be obsolete. I tried to ship a feature 100% agentic and spent three days rolling back what got merged overnight. The lesson was not that agents are useless. The human in the loop ai question is a gradient, not a switch, and I had picked the wrong rung.

The output looked fine on a per-file basis. Tests were green. Commit messages were clean. But the changes drifted from how the codebase was actually shaped and where it was actually going. I realyzed, after the third day of unwinding, that the failure was not about the model. It was about what the model could not see.

The terms, defined plainly

Five terms that need to mean the same thing for the rest of the piece to land.

Agent

A large language model that can call tools and take actions, not just answer questions in a chat box.

Tool-use

The model invoking a function or API the runtime exposed to it — for example, reading a file, running a test, or opening a pull request.

Autonomy gradient

A scale from "the human executes every action and the model only suggests" to "the model executes every action and the human only reviews after the fact." Each rung is a different working relationship.

Architectural model

The mental picture of how the codebase is structured today — which module owns what, which boundaries are load-bearing, which patterns are deliberate, which are accidents nobody has cleaned up yet.

Future-state model

The mental picture of where the codebase is going — the next refactor, the deprecation already scheduled, the abstraction that exists so the next feature is cheap.

The architectural model is "what the code is." The future-state model is "what the code is becoming." Both live in the human's head between sessions. Neither survives an agent's cold start on its own.

The autonomy gradient — three rungs, not a switch

Most "human in the loop ai" debates collapse the gradient into a binary: the human types every line, or the agent runs unattended. That framing throws away the useful middle. There are three working rungs, and each is good at a different shape of task.

Tool-using assistant

Good at: generating a contained function, refactoring inside one file, writing a test the human will read line by line.

Fails at: anything that touches more than one boundary, anything that requires knowing why the boundary exists.

Task class: small, local, reviewable in one screen.

Supervised agent

Good at: running a multi-step plan the human approved, accepting or rejecting each step before the next.

Fails at: drifting onto a different plan when the original plan hits a wall. The agent does not model author intent — it follows the literal instruction and adds.

Task class: medium tasks where each step is a checkpoint.

Autonomous agent

Good at: mechanical, well-fenced work — formatting passes, dependency bumps with a green test suite, batch edits across files where the rule is uniform.

Fails at: anything load-bearing on the architectural or future-state model. The agent defaults to additive behavior — it adds files rather than deleting the file that should have been deleted.

Task class: repetitive, additive, low-stakes.

The mistake I made was using the autonomous-agent rung for a feature that lived on the architectural and future-state models. The agent did exactly what the prompt said. It added the new module, the new types, the new tests. What it did not do — because nothing in its context told it to — was delete the older module the new one was meant to replace, or align the new types with the deprecation scheduled for next month.

A human reviewer who had been in the codebase last week would have caught it in a five-minute glance. The agent could not glance — it did not have last week.

What agents cannot carry across cold sessions

An agent has no working memory between sessions. Every cold start is a new hire on day one. The model has its training and the documents you put in front of it — not the conversation from yesterday, the decision from the standup, or the architectural argument that produced the boundary it is now stepping over.

Two pieces of context matter most for code work, and a cold-start agent cannot reconstruct either on its own:

  1. The architectural model

    The agent can read the file tree. It cannot tell you which boundaries are load-bearing and which are leftovers. Pattern-matching on file names is not the same as knowing why the pattern exists.

  2. The future-state model

    The agent can read the code that is there. It cannot tell you what is on the way out, what is on the way in, or which abstraction was put in place last sprint because of work planned for next sprint.

Both models live, by default, in the heads of people who have been in the codebase. If they stay there, every agent session starts blind. The fix is not a smarter agent. The fix is to externalize both — write them down, in the repo, in files an agent reads on every cold start.

Externalizing the architectural model

The architectural model becomes a small set of repo files that agents read at the start of every session. The names matter less than the contract.

  • A top-level HANDOFF.md that names the current working state — what shipped, what is in flight, what is broken, what is being left alone.
  • A FILE_MAP.md that lists every load-bearing module with one sentence on what it owns and one on what it does not.
  • An AGENT-CONTEXT-MAP.md that names the boundaries — the contracts between modules — and the rules an agent must not cross without explicit permission.

The exact filenames are not sacred. The point is the architectural model has to exist as a document the agent reads on every cold start, instead of as folklore in a senior developer's head. Patterns in the Anthropic SDK Python repo make this concrete: agents work best when the system prompt and surrounding context carry the load-bearing constraints, not when the model is asked to infer them from code alone (anthropics/anthropic-sdk-python).

If the architectural model lives only in the senior developer's head, every agent session is a junior developer's first day.

A junior developer on day one can write a clean function. A junior developer on day one cannot tell you which clean function should not have been written because the module is on the deprecation list. Neither can an agent. Both need the document.

Externalizing the future-state model

The future-state model is harder, because most teams do not write it down at all. It lives in tickets, Slack threads, and the architect's gut. None of those survive a cold start.

The pattern that works is a small directory called docs/tasks/features/ — one short markdown file per planned feature, written before the feature ships. Each file names: the problem, the chosen direction, the boundaries the new code will respect, the modules it will replace, the modules it will leave alone, and the deprecation scheduled when the new code lands.

When the agent is asked to ship a piece of that feature, the relevant docs/tasks/features/<name>.md is part of the context. The agent now has the same future-state picture the human has. Its outputs stop being additive-only — it stops adding the new module while leaving the old one untouched, because the document told it the old one is on the way out.

This is what I was missing in the three-day rollback. There was no future-state document. The autonomous agent built additively because the only thing it had was the present-state codebase. It produced clean code that pointed in the wrong direction. That is what additive behavior produces when nothing tells it what to subtract.

What the human is actually doing

Once both models are externalized, the human's job is not "write every line" or "review every diff." It becomes architectural drift review and accountability.

Architectural drift review means looking at the agent's output not for whether the code compiles, but for whether the code is moving the codebase toward the future state the documents describe. It is a different review than line-by-line — faster, and it catches the failure mode line-by-line keeps missing: every individual line can be correct and the whole change can still be in the wrong direction.

Accountability is simpler. Somebody has to own the merge. The agent does not. The agent will not be in the post-mortem when something breaks in production. The human in the loop ai pattern only works if a human's name is on the pull request and a human's calendar gets the incident page. Otherwise the gradient collapses into "nobody is responsible," which is the worst rung of all.

Picking the rung the task deserves

The actionable version is a short rule.

  1. Tool-using assistant

    If the task is contained and reviewable in one screen, the tool-using assistant rung is right. The human is in the loop on every keystroke.

  2. Supervised agent

    If the task is multi-step but every step is a checkpoint, the supervised agent rung is right. The human approves the plan and accepts each step before the next runs.

  3. Autonomous agent

    If the task is mechanical, additive-by-design, and well-fenced, the autonomous agent rung is right. The human reviews the result.

What you do not do is pick the autonomous rung for a task that lives on the architectural or future-state model unless both models are externalized in the repo. If they are not, the agent has no way to carry them across the cold start. The output will be clean code in the wrong direction, and the rollback cost will eat any speed you thought you were gaining.

"Developers are obsolete" misreads the gradient. The agent does not replace the developer. The agent shifts the developer's work from typing toward externalizing context, picking the rung, and owning the merge. The developers who get good at that get faster. The ones who keep typing every line — or hand the keys over and walk away — are the ones who lose three days to a rollback.

References

Vadim Sharapov is the founder of Loomaru — revenue recovery infrastructure for Shopify stores. If your ad platforms can't see 5–15% of your conversions, loomaru.com.

Want to know what your store's gap looks like, and what closing it would do to monthly revenue?