Planner-Worker Separation for Long-Running Agents

Cursor Team

Problem

Running multiple AI agents in parallel for complex, multi-week projects creates significant coordination challenges:

Flat structures lead to conflicts, duplicated work, and agents stepping on each other
Dynamic coordination through shared files with locking becomes a bottleneck - most agents spend time waiting rather than working
Equal status agents become risk-averse, avoiding difficult tasks and making only small, safe changes instead of tackling end-to-end implementation
No agent takes ownership of hard problems or overall project direction

Solution

Separate agent roles into a hierarchical planner-worker structure:

Planners: Continuously explore the codebase and create tasks. They can spawn sub-planners for specific areas, making planning itself parallel and recursive.
Workers: Pick up tasks and focus entirely on completing them. They don't coordinate with other workers or worry about the big picture. They grind on their assigned task until done, then push changes.
Judge: At the end of each cycle, determines whether to continue or if the goal is achieved.

This creates an iterative cycle where each iteration starts fresh, combating drift and tunnel vision.

Evidence

Evidence Grade: high (production-validated at scale)
Validated Findings: Cursor demonstrated hundreds of concurrent agents running for weeks on massive codebases (1M+ lines of code)
Academic Foundation: Decades of research in hierarchical RL (Feudal Networks, 2017; Options Framework, 1999) provide theoretical backing for planning-execution separation
Multi-Source Validation: Complementary implementations by Anthropic (initializer-maintainer), AMP (factory-over-assistant), and GitHub Agentic Workflows confirm pattern utility

How to use it

Use cases for planner-worker separation:

Massive codebases: Projects that would take human teams months (1M+ lines of code, 1000+ files)
Ambitious goals: Building complex systems from scratch (web browser, Windows emulator, Excel clone)
Large-scale migrations: In-place framework migrations (Solid to React, Java LSP implementation)
Performance optimization: Complete rewrites in different languages for speed (C++ to Rust)

Implementation considerations:

Model choice per role: Different models excel at different roles. Use planning-focused models for planners even if coding-focused models exist for workers.
Fresh starts: Each cycle should start fresh to combat drift and tunnel vision from long-running contexts.
Parallel planning: Planners can spawn sub-planners, making the planning process itself parallel and recursive.
Worker isolation: Workers should be task-focused and not worry about coordination with other workers.

Prompting is critical: Getting agents to coordinate well, avoid pathological behaviors, and maintain focus over long periods requires extensive experimentation with prompts.

Trade-offs

Pros:

Scalability: Hundreds of agents can work concurrently on a single codebase for weeks
Clear ownership: Planners own the big picture; workers own task completion
Parallel planning: Planning itself scales through sub-planner spawning
Reduced coordination overhead: Workers don't need to coordinate with each other
Combats tunnel vision: Iterative cycles with fresh starts prevent drift

Cons:

System complexity: Requires orchestration infrastructure for role separation and task distribution
Prompt engineering difficulty: Coordination behavior requires extensive prompt experimentation
Cost: Running hundreds of concurrent agents for weeks is expensive
Not perfectly efficient: Significant token waste, but far more effective than expected
Still evolving: Planners should wake up when tasks complete; agents sometimes run too long

Key Insights

Model choice matters: GPT-5.2 models are better at extended autonomous work than Opus 4.5, which tends to stop early and take shortcuts. Different models excel at different roles - GPT-5.2 is a better planner than GPT-5.1-codex, even though the latter is coding-specific.
Remove complexity: Many improvements came from removing complexity rather than adding it. An initial "integrator" role for quality control created more bottlenecks than it solved - workers were already capable of handling conflicts.
Middle structure: The right amount of structure is in the middle. Too little structure and agents conflict, duplicate work, and drift. Too much structure creates fragility.
Distributed systems don't always translate: Initial attempts to model systems from distributed computing and organizational design didn't work for agents.

Examples

Cursor's experiments:

Web browser from scratch: 1 million lines of code across 1,000 files, running for close to a week
Solid to React migration: 3 weeks with +266K/-193K edits in the Cursor codebase
Video rendering optimization: 25x speedup with efficient Rust rewrite
Java LSP: 7.4K commits, 550K LoC
Windows 7 emulator: 14.6K commits, 1.2M LoC
Excel clone: 12K commits, 1.6M LoC

References

Scaling long-running autonomous coding - Cursor blog post on running hundreds of concurrent agents for weeks at a time
Browser source code on GitHub - 1M+ lines of agent-generated code
Feudal Networks (FuN) - ICML 2017 paper introducing manager-worker separation in hierarchical RL (Vezhnevets et al.)
The Options Framework - Seminal work on temporal abstraction creating planning-execution hierarchy (Sutton et al., 1999)
HIRO: Hierarchical RL with Off-Policy Correction - ICML 2020 paper on high-level planners and low-level workers (Lee et al.)

Source: https://cursor.com/blog/scaling-agents