AI-assisted development breaks down after the first session. Context resets, lost decisions, and confidently wrong completions compound into real costs. The Stack Overflow 2025 survey found that 66% of developers are frustrated by AI solutions that are "almost right but not quite," and 45% say debugging AI-generated code takes longer than writing it themselves. After hundreds of sessions building an AI system as a non-programmer, I found that the fix isn't better prompting. It's building structured infrastructure around the AI: separated roles, layered memory, independent verification, and explicit handoff protocols.

This is Part 1 of the Building the Builder series. Part 2 covers how 9 documented failures turned into an autonomous pipeline. Part 3 covers what happened when I stopped celebrating speed and started measuring quality.

Everyone's talking about AI writing code. Very few people talk about what happens after the first session.

Here's what happened to me: I'd start a coding session with Claude Code, make great progress, hit context limits, start a new session, and spend the first 20 minutes re-explaining what we'd already decided. Then the AI would suggest undoing something we'd deliberately built two sessions ago. Then I'd copy-paste context between windows, lose track of what was implemented versus what was still planned, and at some point realize the AI had told me "done" when it wasn't.

I'm not a developer by background. 20 years of business development, partnerships, and go-to-market in Data & AI. Cognitive science by education. I started building intentic, an AI-powered GTM analysis platform, in September 2025 using AI as my development partner. And the system I'm about to describe didn't exist on day one. It grew out of frustration, broken deploys, and the realization that AI-assisted development doesn't scale by default. You have to build the scaffolding yourself.

I want to be clear upfront: I'm sure there are far more sophisticated setups out there. People with engineering backgrounds who've solved problems I haven't even encountered yet. This isn't a "best practices" guide. It's a snapshot of what one non-programmer built to keep AI useful across hundreds of sessions. If you've built something better, I genuinely want to hear about it.

The problem nobody warns you about

The pitch for AI-assisted development is compelling: describe what you want, the AI writes the code. And for isolated tasks, that's true. But building a real system means hundreds of connected decisions across weeks and months. Which architecture pattern did we choose and why? What did we explicitly decide NOT to do? Where are the known trade-offs we accepted?

None of that survives a context reset.

GitClear's analysis of 211 million lines of code found that code churn, the percentage of code reverted within two weeks, nearly doubled from 3.1% in 2020 to 5.7% in 2024. Copy-pasted code now exceeds refactored code for the first time. More code gets written, but more of it gets thrown away.

My early experience was a loop: build something, lose context, rebuild understanding, discover the AI had confidently "finished" something that was actually half-done, fix it, lose context again.

The most expensive bugs didn't come from bad code. They came from lost decisions.

The turning point wasn't better prompting. It was accepting that I needed to build infrastructure around the AI, not just use it.

Two brains, one kanban board

The core of what I built is a separation between thinking and doing.

Claude Desktop handles strategy: planning features, writing specs, researching approaches. It doesn't touch the codebase directly. When it needs technical context, Claude Code analyzes the code and feeds the relevant information back (via commands like /enrich-briefing). Claude Code handles implementation: writing code, running tests, deploying. Vibe Kanban sits between them as the shared state.

The two AI instances never talk to each other directly. They communicate through structured artifacts: tasks that flow from [BRIEFING] (here's the idea) to [SOLUTION] (here's the spec) to [IMPL] (here's the implementation). Each transition has a defined handoff with context, constraints, and acceptance criteria.

This separation matters because strategic thinking and implementation need fundamentally different things. When I'm exploring whether to restructure the data pipeline, I need broad context, research, trade-off analysis. When the AI is implementing a specific schema change, it needs narrow focus, exact file paths, and clear acceptance criteria. Mixing both in one session is how you get an AI that starts philosophizing about architecture when it should be writing a migration.

In practice, a feature starts with me describing what I want to build in Claude Desktop. If technical context is needed, Claude Code runs an analysis of the relevant codebase area and feeds it back. Then /gap-analysis shows what's missing. /create-plan turns that into a structured plan. /task-breakdown creates the actual tasks in Vibe Kanban with acceptance criteria. Claude Code picks up those tasks and implements them.

Is this over-engineered for some tasks? Probably. For a quick bug fix, I skip most of this. But for anything that touches multiple parts of the system, this flow is what keeps things from going sideways.

The 2024 DORA report found that a 25% increase in AI adoption correlated with a 7.2% decrease in delivery stability. Speed without structure creates instability. The separation between strategic thinking and implementation is my attempt to avoid that trap.

Solving the "confident lie"

Here's something that surprised me:

AI doesn't just make mistakes. It makes mistakes while telling you everything is done.

The Stack Overflow 2025 survey backs this up: only 33% of developers trust AI accuracy, down from 40% in prior years. And Faros AI's research found that AI-driven development increased bug rates by 9% while code review time grew 91% as PR volume outpaced reviewer capacity.

I lost count of how many times Claude Code would report a task as complete, and I'd discover hours later that edge cases were unhandled, tests were missing, or the implementation subtly diverged from the spec. Not because the AI was bad at coding. Because it had no incentive or mechanism to doubt its own work.

My solution was blunt: don't let the same AI verify its own work.

Implementation tasks now run through two agents. An executor (using a cost-effective model) implements the task autonomously. Then an independent reviewer (using a stronger reasoning model) checks the work against the acceptance criteria. The reviewer doesn't see the executor's self-assessment. It runs its own verification: type checks, lint, test execution, actual inspection of whether the acceptance criteria are met.

On top of that, I built hooks that physically prevent a task from being marked "done" without evidence. The done-guard hook checks for a verification file that must contain actual test output, not just a timestamp. If the file is empty or missing, the status update gets rejected.

Is this bulletproof? No. The reviewer can miss things too. But the rate of "confident lies" making it to my review dropped dramatically. And the acceptance criteria in each task became much more precise over time, because I learned that vague ACs produce vague verification.

Context as infrastructure

The deeper lesson behind all of this:

The real challenge isn't that AI forgets. It's making sure the right knowledge is available at the right time.

Google's 2025 DORA report found that 90% of engineering teams now use AI, but the key insight was that AI amplifies what's already there. Strong systems get stronger, struggling systems get worse. The difference isn't the tool. It's the system around it.

I ended up with what amounts to a layered memory system, though I didn't plan it that way. It grew organically from specific problems.

Procedural memory is the foundation: 11 rule files that load with every Claude Code session. NEVER/ALWAYS constraints, workflow rules, architecture principles. Every single one exists because something went wrong without it. The constraint "NEVER set a task to done without explicit AC verification" is there because I spent a weekend debugging something the AI had declared finished.

Episodic memory lives in Vibe Kanban: tasks with timestamps, status transitions, descriptions of what was decided and why. When a new session starts, Claude Code can pull the current task and immediately has context for what it's working on.

Semantic memory uses Mem0 for cross-session learnings: "Exa fails for cross-industry searches," "group tasks by file when multiple ACs affect the same file." Things that aren't rules but patterns that improve future work. The critical lesson here: search before adding. Without duplicate detection, semantic memory fragments into hundreds of near-identical entries that dilute retrieval quality.

Session management handles the tactical level: what happens when context fills up mid-task. Save current state, clear the session, reload. Not glamorous, but it prevents the slow degradation where AI responses get worse as context fills with irrelevant earlier conversation.

LayerWhatWhereExample
ProceduralRules and constraints11 rule files, loaded every session"NEVER set task to done without AC verification"
EpisodicDecisions and historyVibe Kanban tasks with timestamps"We chose BullMQ over Kafka because..."
SemanticCross-session patternsMem0 with search-before-add"Exa fails for cross-industry searches"
SessionActive working stateContext management per sessionSave state, clear, reload on overflow

Beyond these memory layers, there are 19 slash commands that automate recurring workflows, 8 hooks that enforce guardrails, and a growing library of skills that provide domain-specific knowledge on demand. Skills use progressive disclosure: only metadata loads at startup (around 50 tokens per skill), full content loads when needed. This alone reduced context overhead by over 90%.

And then there's what I call context infrastructure at the code level: pipeline contracts that define what each component produces and consumes, schema validation that catches broken interfaces at startup, data flow documentation that's enforced in CI. (I wrote about the infrastructure decisions behind this, including data sovereignty trade-offs, in Data Sovereignty as a Solo Founder.) All of this is context that the AI can access when it needs to understand how a change in one place affects the rest of the system. Without it, every session would start with the AI re-discovering these dependencies the hard way.

Where this is going

I want to be honest: I'm not fully utilizing what's possible here. Vibe Kanban has capabilities I haven't tapped yet. The handoff between Desktop and Code still requires me to manually trigger transitions. The review step, while better than nothing, is still a workaround for a deeper problem.

The vision I'm working toward: start with an idea in Claude Desktop, and everything flows automatically through analysis, planning, task creation, implementation on a branch. I test, I review, I approve. True human-in-the-loop where the human sets direction and gives final sign-off, but doesn't manually orchestrate every step.

We're still early. Cognition's Devin resolved 13.86% of real-world GitHub issues autonomously on SWE-bench, up from a previous best of 1.96%. Claude 3.5 Sonnet reached 49% on SWE-bench Verified using an agentic scaffold. The trajectory is clear, but even the best agents need structured environments to be reliable.

Projects like HumanLayer (and their newer CodeLayer IDE built on Claude Code) are a genuine inspiration here. They're tackling the same fundamental question from a more engineered angle: how do you give AI agents enough autonomy to be useful while keeping humans in control of what matters? Their approach to approval workflows and agent orchestration is closer to where I want to end up than where I am today.

This article is the overview. Part 2 goes deeper on what happened when the pipeline started running autonomously and broke nine times. Part 3 covers the pivot from speed to quality, and the rebuild into a 9-phase, 10-agent system.

The thing nobody talks about

When you build with AI, you're building two things simultaneously: the product, and the system that keeps the AI productive. The second one is at least as much work as the first. And almost nobody talks about it.

Every rule file, every hook, every slash command represents a problem I hit and solved. The system works for me, today, at the scale I'm operating at. It will probably look different in three months as tools improve and I learn more.

So here's my actual question: how do you handle this? If you're building with AI coding agents, what does your workflow look like? What have you solved that I'm still struggling with? What tools or patterns have made the biggest difference?

I'm genuinely here to learn. DMs open, comments welcome.

Frequently Asked Questions

Do you need to be a developer to use AI coding tools effectively?

No, but you need compensating systems. The Stack Overflow 2025 survey shows 84% of developers use AI tools, yet only 33% trust the output for accuracy. Even experienced developers need verification workflows. As a non-developer, I rely more heavily on structured acceptance criteria, automated tests, and independent review agents to catch what I can't spot myself.

How do you prevent AI coding agents from undoing previous decisions?

Layered context. Procedural memory (rule files loaded every session) encodes hard constraints. Episodic memory (task history with rationale) preserves decision context. Semantic memory (Mem0 with search-before-add) captures cross-session patterns. No single layer is sufficient. The combination ensures that critical decisions survive context resets between sessions.

Is the two-agent verification pattern worth the extra cost?

Yes. GitClear's analysis of 211 million lines found that AI-assisted code churn nearly doubled between 2020 and 2024, and Faros AI measured a 9% increase in bug rates with AI adoption. An independent reviewer catches errors that self-verification misses. The marginal cost of a second model call is far less than the debugging time saved from catching confident but incorrect completions.

What's the biggest mistake people make with AI-assisted development?

Treating it as a tool problem instead of a systems problem. Google's 2025 DORA report found that AI amplifies existing team dynamics: strong systems get stronger, weak systems get worse. Most people optimize prompts when they should be building infrastructure: memory layers, verification gates, structured handoffs, and explicit constraints that persist across sessions.


Pedram Shahlaifar comes from B2B go-to-market, not engineering. He's building intentic as a learning project, using AI to build an AI system, and writes about the trade-offs along the way. Connect on LinkedIn.

Sources

  1. Stack Overflow

    2025 Developer Survey (n=49,000, May-June 2025)

  2. GitClear

    AI Copilot Code Quality: 2025 Research (211M lines, 2020-2024)

  3. Google DORA

    2025 State of AI-assisted Software Development Report

  4. Google DORA

    2024 Accelerate State of DevOps Report

  5. Faros AI

    Rework Rate: 5th DORA Metric analysis (2025)

  6. Cognition AI

    Devin SWE-bench Technical Report (2024)