9 Ways My AI Pipeline Broke (And Why Each Failure Made It Better)

Every rule in my autonomous AI development pipeline exists because something specific broke. After 9 documented failure versions, the system delivers full-stack features in 25 minutes with parallel AI agents, quality gates, and mechanical enforcement. Deloitte predicts that 25% of companies using generative AI will launch agentic AI pilots in 2025, growing to 50% by 2027. Most of those pilots will hit exactly the problems I've already solved the hard way.

In Part 1 of this series, I ended with a vision: start with an idea in Claude Desktop, and everything flows automatically through analysis, planning, task creation, and implementation on a branch. I test, I review, I approve.

Eight days later, it was running. Not because I designed it well. Because it broke nine times, and each break taught me something I couldn't have figured out in advance.

I call it the Vibe Machine. It's an autonomous development pipeline built on Vibe Kanban, Claude Code, and a collection of agent definitions, rules, and hooks that grew entirely out of real failures. 774 lines of PLANNER orchestration. 15 rule files. 11 hooks. 19 slash commands. 3 agent definitions. None of it planned on day one. All of it earned.

You can't design an autonomous system. You debug it into existence.

The instinct is to draw an architecture diagram and implement top-down. That doesn't work. Not because the architecture was wrong, but because you can't predict how AI agents will misbehave until they actually do.

The 2025 Stack Overflow survey found that while 84% of developers use AI tools, only 33% trust their accuracy. That trust gap isn't about model quality. It's about the absence of systems that catch mistakes before they compound. Qodo's State of AI Code Quality research puts a finer point on it: 65% of developers cite missing context as the primary cause of poor AI-generated code, more than hallucinations.

What worked instead of upfront design: start small, run it, watch it break, fix the specific thing that broke, run it again. Nine times. Each version has a name in my documentation, a specific failure pattern, and a specific fix. The fixes compound. Version 3 broke because of build artifacts in commits. Version 9 broke because workers forgot to update their status. The distance between those two problems tells you how much the system matured.

You can't pre-design reliability into an autonomous system. You iterate it into existence.

Failure 1: The PLANNER that wouldn't delegate

The PLANNER's job is straightforward: analyze the codebase, cut tasks, start workers, monitor progress, merge results. It should never write a single line of code.

In version 4, it did exactly that. I told it "NEVER implement yourself" in the agent definition. It ignored the instruction and started writing code directly. Not maliciously. It saw a small change and thought it could handle it faster than spawning a worker.

The fix for v4 was a stronger text constraint. That held for about a week. In version 6, the same pattern returned. The Anthropic Economic Index found that 79% of Claude Code conversations involve some form of automation. When an AI agent is built to act autonomously, text-based constraints are suggestions it can override whenever its own reasoning says otherwise.

The real fix: I removed the Write and Edit tools from the PLANNER entirely. At the tool level, not the prompt level. The PLANNER physically cannot create or modify files. It can only read, analyze, and orchestrate.

Prompts are wishes. Tool restrictions are laws. This became the foundational principle for everything that followed.

Failure 2: The step nobody cares about

Version 9 was the most frustrating failure because it was the most boring. Workers would implement a feature correctly, pass all tests, write clean code. Then they'd skip the final step: updating the task status in Vibe Kanban.

The PLANNER polls for status changes every 60 seconds. If a worker never updates, the PLANNER waits forever. Pipeline stalled. No error, no crash. Just silence.

According to GitHub's Octoverse report, monthly code pushes crossed 82 million in 2025, with roughly 41% being AI-assisted. That's an enormous volume of automated work where the boring operational steps (status updates, CI triggers, merge protocols) become the actual failure points. The bottleneck has moved from writing code to everything around it.

My fix required three layers. First, the FINAL STEP block in the worker's agent definition became the most prominent instruction in the entire file. Second, the profile's append_prompt explicitly says: "Set 'In review', NOT 'Done'. Run tsc/lint before status update." Third: 3-tier liveness detection. The PLANNER checks task status, workspace state, and git activity. No commits in 30 minutes means the worker is dead. Retry with a fresh instance.

The boring operational step is the single most critical point in the entire pipeline. Every autonomous system has one of these.

Failure 3: Quality gates born from trust erosion

I didn't plan to build a multi-gate quality system. I planned to trust the output and verify manually in the morning. That stopped working fast.

The numbers explain why. Faros AI's research found that AI-driven development increased bug rates by 9%, while code review time grew 91% as PR volume outpaced reviewer capacity. CodeRabbit reports that over 40% of AI-generated code still contains security flaws. When you're running an autonomous pipeline overnight, that's not an acceptable error rate.

My response was incremental. Each gate is the scar of a specific incident where something got through that shouldn't have.

Gate 1: Concept Review. Before the PLANNER even starts cutting tasks, a different model entirely reviews the solution specification. Gemini 3.1 Pro via OpenRouter checks for gaps, contradictions, and scope risks. Two-round consensus: feedback in round one, final verdict in round two. This catches the problems that happen before any code is written: vague acceptance criteria, contradictory constraints, scope that's too wide for atomic task-cutting.

Gate 2: Implementation Check. After the PLANNER cuts tasks but before workers start, GPT-5.3 Codex validates the plan. Does any task overlap another task's file scope? Are dependencies between tasks respected? Do the proposed file paths actually exist in the codebase? This gate exists because I once had two workers editing the same file simultaneously, producing a merge conflict that took longer to resolve than the feature itself.

Gate 3: Independent Review. Workers can't mark their own work as done. They set their task to "In review," and the PLANNER spawns an independent Reviewer (Opus model, 309 lines of agent definition) on the worker's branch. The Reviewer runs its own tsc and lint checks. It doesn't trust the worker's self-assessment. It verifies each acceptance criterion independently and inspects the actual git diff.

Gate	When	What	Model	Why it exists
Gate 1: Concept Review	Before task-cutting	Spec gaps, contradictions, scope risks	Gemini 3.1 Pro	Vague specs produced vague implementations
Gate 2: Impl-Check	After task-cutting, before workers	File overlaps, dependencies, path existence	GPT-5.3 Codex	Two workers edited the same file
Gate 3: Independent Review	After worker completion	tsc/lint, AC verification, git diff	Opus	Workers reported "done" when they weren't

Three gates, three different models, three different concerns. Jellyfish analyzed 1,000 code reviews across 400 companies and found that 36% of developer interactions with AI review agents were positive. My experience matches: once the review agents started catching real issues, I stopped resenting the extra step.

The key insight: each gate targets a different failure mode. Gate 1 catches bad specs. Gate 2 catches bad plans. Gate 3 catches bad implementations. No single review can cover all three because they happen at different stages with different information available.

What I can do now versus three months ago

Three months ago, my workflow looked like this: describe a feature in Claude Desktop. Copy context to Claude Code. Implement piece by piece. Lose track of what's done. Discover the AI declared something "finished" that wasn't. Re-explain decisions from two sessions ago. Spend weekends debugging.

Google's 2025 DORA report found that AI amplifies what's already there: strong systems get stronger, struggling systems get worse. The infrastructure I built in those frustrating months turned out to be exactly what the autonomous pipeline needed to amplify.

Now: I write a solution specification in Desktop. I say "Go." The PLANNER starts autonomously. It reads the spec, analyzes the codebase, runs the concept review, cuts atomic sub-tasks with disjoint file scopes, starts up to two parallel workers, polls their progress, spawns reviewers when workers finish, merges everything onto a feature branch, runs CI, and deploys to staging. I review in the morning, test on staging, and merge to production.

Dimension	Before (3 months ago)	Now (Vibe Machine)
Feature delivery	Manual, multi-session, days	25 minutes, autonomous
Context management	Copy-paste between windows	15 rules + 19 commands loaded per session
Quality assurance	Manual morning review	3 quality gates + independent reviewer
Task tracking	Spreadsheet or memory	Vibe Kanban with structured handoffs
Verification	Trust the AI's self-report	Mechanical enforcement, tool-level restrictions
Failure detection	Discover problems hours later	3-tier liveness monitoring, 60s polling

The most recent run built an inline feedback UI: database migration, REST API endpoints, Zustand store, React components, and five report-section integrations. Two parallel workers, two independent reviewers, CI green on first attempt. 25 minutes end-to-end.

That's not magic. It's 15 rule files, 11 hooks, and 9 failure versions of accumulated knowledge about how AI agents actually behave when you let them run autonomously.

Three takeaways for builders

You can't pre-design reliability into an autonomous system. Every rule file, every hook, every quality gate in my system exists because something specific broke without it. Start with the simplest possible version, run it, and let the failures tell you what to build next. The system self-improves in a very literal sense: each run surfaces edge cases that become new constraints.

Prompts are wishes, mechanical enforcement is law. When a behavior is critical to pipeline integrity, text constraints in agent definitions are not enough. Remove the tools, enforce the status flow, build hooks that physically prevent the wrong outcome. This is consistent with what Google's DORA research found: speed without structure creates instability.

Multi-model review isn't overhead, it's infrastructure. Three different models catching three different failure modes at three different pipeline stages costs fractions of a cent per run. The cost of a single undetected bug cascading through an autonomous overnight pipeline is always higher than the cost of reviewing.

What's still not perfect

The system isn't bulletproof. Reviewers sometimes forget to update the task status from "In review" to "Done," which is the exact problem I already solved for workers. The 5-layer enforcement strategy works, but it's not at 100% yet. Each run makes it more robust.

Gate 3 (the independent worker review) functions through the Reviewer agent, but the formal scope-violation detection via git diff against a file whitelist is still a documented concept, not an automated check. The Reviewer catches most issues, but the structural enforcement isn't fully there yet.

Where this is going

The immediate next step is closing the remaining reliability gaps. The system genuinely self-improves: every failure pattern becomes a new rule, a new hook, or a new gate. Nine documented versions so far. There will be more.

After that: removing the last manual trigger. Right now, I define features in Desktop and set them to "To do" in Vibe Kanban, but I still manually say "Go" to start the pipeline. The next evolution is a pipeline that pulls the next feature from the backlog automatically. I define direction, I prioritize, I review results. The machine handles everything in between.

That's the actual end state of human-in-the-loop development. Not a human orchestrating every step. A human setting direction, a machine executing, and a human verifying the result.

Part 3 continues this story. Spoiler: 25 minutes per feature sounds great until you measure what actually comes out.

Frequently Asked Questions

Can you build an autonomous AI pipeline without being a developer?

Yes, but it takes longer and you'll rely on the AI to build the scaffolding. The 15 rule files and 3 agent definitions in my system were all built iteratively with AI assistance. Start with the simplest version, run it, fix what breaks.

How do you prevent AI agents from going off-script in an autonomous pipeline?

Mechanical enforcement. The PLANNER has Write and Edit tools removed entirely. Workers set "In review" instead of "Done." An independent Reviewer verifies before closing. Prompts get ignored under pressure. Tool restrictions don't.

Is multi-model review worth the cost for solo developers?

Yes. Three models (Gemini, Codex, Opus) catching different failure modes costs fractions of a cent per run. A single undetected bug in an overnight pipeline cascades before you catch it in the morning.

How long does it take to build a system like this?

Eight days from vision to working pipeline, but the foundation took months. The autonomous pipeline was the culmination, not the starting point. And it's still evolving.

Pedram Shahlaifar comes from B2B go-to-market, not engineering. He's building intentic as a learning project, using AI to build an AI system, and writes about the trade-offs along the way. Connect on LinkedIn.

Sources

Deloitte
TMT Predictions 2025: Autonomous Generative AI Agents
Stack Overflow
2025 Developer Survey (n=49,000)
Qodo
State of AI Code Quality 2025
Anthropic
Economic Index: AI's Impact on Software Development (April 2025)
GitHub
Octoverse 2025 Report
Faros AI
Rework Rate: 5th DORA Metric Analysis (2025)
CodeRabbit
Agentic Code Validation (2025)
Jellyfish
Impact of AI Code Review Agents (1,000 reviews, 400 companies, 2025)
Google DORA
2025 State of AI-assisted Software Development
Google DORA
2024 Accelerate State of DevOps Report