Skip to main content

Snowball of Optimization: A 3-Week Codebase Refactor for AI

I realyzed after two weeks I had shipped no features — only refactored. Here is why ai context engineering on a messy codebase weaponizes your agent against you.

By Vadim Sharapov9 min read
ai-context-engineeringagentic-codingdeveloper-workflow

Last two weeks I realyzed I had shipped no features — only refactored. 3 weeks of depression, in fact, spent moving files instead of writing new ones. The funny part is that this was the most important coding work I have done all year. The unfunny part is that I did not see it until day eighteen, by which point I had convinced myself I was procrastinating. I was not. I was doing ai context engineering on a codebase that had silently grown into a trap for the agent I was paying to help me ship. This is the story of why that happened, what I changed, and why scoped loading is the durable principle, not a bigger context window.

What I mean by "ai context engineering"

Before the war story, plain definitions, because half this conversation is people meaning different things with the same words.

Refactor

Rewriting code without changing what it does. Same inputs, same outputs, cleaner shape.

Tech debt

Shortcuts you took earlier that cost you later. Every "I will clean this up next week" deposit accrues interest.

Codebase

The full set of source files that make up a product — every file the agent might be asked to read or change.

CI

Continuous Integration. The automated checks that run every time you push, before code merges.

MVP

Minimum Viable Product. The smallest version of a thing that proves the core idea works in front of real users.

The phrase ai context engineering is the part I had to learn the hard way. It is not prompt engineering. It is not "use a longer context window". It is the boring work of organizing a repo so that when an agent reads a slice of it for a task, that slice is enough. Nothing extra. Nothing missing. The agent is a literal, additive pattern matcher — give it three contradictory copies of the same idea and you get an additive blend of three contradictory copies, not a synthesis. There is no author intent in the loop, only what the bytes say.

I had not done this work. I had a working MVP and a 12,000-line repo. The two facts looked unrelated. They were not.

Why "best practices" weren't enough

The story begins on a Sunday. The MVP was live. The pipeline was sending events. Meta was receiving them. Two months earlier I had wired up the Meta Conversions API param builder SDK so the payload carried every supported parameter cleanly. The result:

Events Manager's match-quality dial moved from yellow into green. The MVP worked. Then I looked at my codebase.

What I saw was the obviouse problem nobody warns you about when you pair with an agent every day for six months. Three slightly different copies of the param-mapping logic. Two router files that both thought they were the entry point. A "utils" folder that had become a graveyard of one-off helpers, each written by an agent in a different week, each solving a problem that already had a solution four files away. The codebase was not broken. It was just heavy. Every time I asked the agent to add a new event type, it had to read more and more before it could write less and less.

Week one I told myself I would clean it up "while shipping". This is the lie every founder tells themselves about tech debt. By Friday I had merged two router files into one, deleted forty helper functions, and shipped zero features. I felt productive. I was not. I was paying down interest.

Week two I admitted I was not shipping. I called it a refactor sprint. I gave myself seven days. I missed it. The codebase was tangled in a way that did not yield to surgical edits. Pull one thread, three others came loose. Best practices — extract a function, rename a variable — were not enough. The shape was wrong, not the contents.

Week three I stopped pretending. I opened the whole repo and did the only thing that worked: rewrote the architecture from the outside in.

Modules, a thin router, and tests — the discipline for ai context engineering

The discipline that actually unstuck the codebase was three rules, written on a sticky note, taped to my monitor.

  1. One module per concept

    Every domain idea — event normalization, signature signing, retry policy, dedupe — lives in exactly one file. If a sibling file does the same thing, the sibling file is wrong and gets deleted.

  2. A thin router that does no business logic

    The router reads the request, decides which module to call, and returns the result. If the router has an if that touches business data, that if belongs in a module.

  3. Every module has a test that runs in under a second

    A unit test that pins the contract on a tiny example. The test is the contract, written down, that the agent reads first.

Why three rules and not ten? Because the agent is not a teammate. It is a literal pattern matcher with no author intent. It reads what is there and produces more of what is there. If "what is there" is one canonical module per concept with a contract test next to it, the next thing it produces is also one canonical module with a contract test next to it. The discipline is the input distribution the agent extends, not a style preference.

This is where ai context engineering stops being a slogan and starts being a daily habit. Three consequences:

  • The agent's read budget collapses. The relevant slice for any task is one module + its test + the router line that calls it. Not five sibling files, not a "utils" graveyard.
  • Refactors stop fanning out. Each module owns its contract and the test pins it. Change the module, the test fails, you fix it. No silent ripple through four other files.
  • The router becomes legible in one sitting. Onboarding a new collaborator — or the agent on day one — is reading one file, top to bottom.

It took me 2 weeks to get the codebase into this shape. The third week was when the speed showed up.

Knowledge base outside the repo

Here is the part nobody told me. The codebase refactor is not enough. You also need a parallel knowledge base, and it has to live OUTSIDE the source tree.

The reason is mechanical. If you put architectural notes inside src/, the default file glob pulls them into every task, even tasks where they are noise. You burn tokens on prose the agent will not act on. Worse: the prose drifts. You change a function and forget to update the comment two folders away. Now the agent reads two contradictory truths and additively blends them.

The fix I landed on:

HANDOFF.md

One file at the repo root. The current state of the project in under thirty lines. What is shipped. What is in flight. Where the work resumes.

FILE_MAP

A list of every file in the repo with a one-line "this is what it does". The agent reads this BEFORE it reads any file. The map tells it which slice to pull.

AGENT-CONTEXT-MAP

Per-task-class context budgets. "If the task is adding an event type, load these five files. If the task is changing the signing logic, load these three." Externalized scoped-loading instructions, kept under thirty lines per class.

EMQ

Event Match Quality. Meta's 1-to-10 score for how well a server event maps to a known person. Higher input completeness, higher score. The dial in Events Manager is the visible read of it.

CAPI

Conversions API. Meta's server-side event ingestion endpoint. The thing the param-builder SDK helps you call cleanly.

Scoped loading

The practice of loading only the file slices an agent needs for a specific task. Not "the whole repo". Not "everything in src/". Exactly the slice the AGENT-CONTEXT-MAP tells you to load.

These docs do not live inside src/. They live at the repo root or in a docs/ peer folder, and they are not in the default glob. The agent loads them by name, when a task class warrants it.

The drift problem solves itself once you choose where to test:

Test the doc

The doc churns every time the code shape moves; the code never gets pinned.

Test the code

The code is pinned by a one-second test; the doc is a navigation aid, not a contract.

Test the code. Always. The doc is for finding things. The test is for pinning things. If you test the doc, you create a second source of truth that drifts the moment you ship anything.

Scoped loading beats context-window obsession

Everyone gets this wrong, including me for the first month. The live 1M-token window is sitting right there. Anthropic ships it. The Anthropic SDK Python makes calling it trivial. So why am I still going on about scoped loading?

Because a bigger window is not a smaller problem. It is a more expensive copy of the same problem. Three reasons.

  1. Cost. Tokens are not free. A million-token read on every task is a million tokens of latency and a million tokens of bill, on tasks where four hundred tokens of relevant slice would do.
  2. Signal-to-noise. Give the agent ten copies of a concept and it produces an additive blend of ten copies. Give it one canonical module and it produces a sibling of the one canonical module. Bigger windows pull more noise into the prompt; they do not improve discrimination.
  3. The pathology you are papering over. If you "need" a million-token window for a routine task, your codebase is the problem. The window treats the symptom. Scoped loading treats the cause.

The principle that lasted, after three weeks of beating my showlder against the wall: organize the codebase so the relevant slice for any task is small, then load only that slice. The window size is irrelevant once the slice is right. This is the durable definition of ai context engineering — slice discipline, not window size.

Luckuly for me, I learned this on a 12,000-line repo and not a 200,000-line one. The cost of doing it later would have been an order of magnitude higher.

What I would do differently

Three things, in order of cost-saved:

  1. Start the FILE_MAP on day one

    Not week three. The map is cheap when the repo has eight files and impossible when it has eight hundred. I did it backwards.

  2. Write the contract test before the agent writes the module

    The test is the input the agent extends. If you write the module first and the test after, the agent has already learned the wrong shape. Reverse the order.

  3. Refuse the second copy

    Every time the agent proposes a "helper" that looks like an existing module, you are at a fork. The cheap path accepts the helper. The expensive path deletes it and routes through the existing module. The cheap path becomes week three of nothing-shipping. Take the expensive path on day one.

The sentence I would tape over my monitor for the next project: the codebase is the prompt. Whatever the codebase looks like is what the agent extends. There is no second prompt that fixes a broken first one. There is only the shape of the repo and the slice you load from it.

That is ai context engineering, plainly. Three weeks of it cost me features I did not ship. Skipping it would have cost a year.

References

Vadim Sharapov is the founder of Loomaru — revenue recovery infrastructure for Shopify stores. If your ad platforms can't see 5–15% of your conversions, loomaru.com.

Want to know what your store's gap looks like, and what closing it would do to monthly revenue?