The ReAct Loop Is a Prototype. This Is What Production Looks Like.

The agent calls a tool. Gets a result. Calls another tool. Gets another result. Somewhere around turn 40 it's still calling tools, the output has lost coherence, and you've burned 180,000 tokens answering a question the first 15 turns already had.

This is the loop spiral. Not a bug in any specific implementation. A structural property of while-loop agents: they have no concept of what phase they're in, which tools are legal right now, how many turns they're allowed, or what it takes to move forward. Every one of those decisions gets delegated to the model on every turn.

LangChain's 2026 State of Agent Engineering survey puts 57% of teams with agents in production. Quality is the #1 reported problem. The root cause in most cases isn't the model. It's that nobody built the structure that tells the model when to stop exploring and start concluding.

A phase machine makes that structure explicit. This post shows what one looks like in a production analytics agent: 12 phases, per-phase turn budgets, tool allowlists enforced at the API layer, human-in-the-loop gates that are distinct from phase transitions, and a budget pressure mechanism that steers the agent toward completion rather than killing it.

What a loop can't tell you

A while-loop agent can tell you one thing: what the model's current output is. Ask it anything structural and the answer is "it depends on the model."

How far along is the agent? It depends. Which tools is it allowed to call right now? All of them. When does it stop? When the model decides. What does it take to advance to the next step? There is no next step.

This matters because agents fail in predictable ways when they have no phase concept. An agent that can call record_finding in the first turn has no reason to build a solid analytical foundation before recording conclusions. An agent that can call submit_analysis_plan during execution can rewrite its methodology after seeing the data. An agent with no per-phase budget cap will keep calling tools until it hits a hard token limit, at which point it produces truncated output that's often worse than nothing.

Posts like Oracle's overview of the agent loop do a good job describing what a loop gives you. The gap is what it doesn't give you: enforcement. Frameworks like LangGraph give you a graph abstraction over this, which is genuinely useful for complex cyclical workflows. What they don't give you is the design decisions inside a node: what budget does this phase get, which tools are legal here, how does the agent know it's done. Those are architectural decisions you make regardless of which orchestration layer you use.

There's a third failure mode loops don't advertise. A loop that runs long enough accumulates so much context that earlier instructions become effectively invisible. The model at turn 40 is not attending to the same things it was at turn 1. System prompt instructions get pushed down by tool call results, observations, and intermediate reasoning. Phase transitions reset this: each new phase opens with a focused system prompt scoped to that phase's purpose rather than a 40-turn accumulation of everything that came before. The context window is an asset in a phase machine. In a loop, it becomes a liability the longer the session runs.

Phases, budgets, and legal moves

The full phase sequence for a multi-step analytical agent:

class DataAnalysisPhase(StrEnum):
    UNDERSTAND         = "UNDERSTAND"
    CLARIFY            = "CLARIFY"
    CLASSIFY           = "CLASSIFY"
    PLAN               = "PLAN"
    PLAN_ANALYSIS      = "PLAN_ANALYSIS"
    EXECUTE            = "EXECUTE"
    ROBUSTNESS_SWEEP   = "ROBUSTNESS_SWEEP"
    VALIDATE           = "VALIDATE"
    CRITIQUE           = "CRITIQUE"
    ADVERSARIAL_REVIEW = "ADVERSARIAL_REVIEW"
    REPORT             = "REPORT"
    FOLLOW_UP          = "FOLLOW_UP"

Each phase has a turn budget that resets on every phase transition. The budgets are not arbitrary:

CAP_ORCHESTRATOR_TURNS_PER_PHASE = {
    DataAnalysisPhase.UNDERSTAND:          4,
    DataAnalysisPhase.CLARIFY:             6,
    DataAnalysisPhase.CLASSIFY:            6,
    DataAnalysisPhase.PLAN:                8,
    DataAnalysisPhase.PLAN_ANALYSIS:      12,
    DataAnalysisPhase.EXECUTE:            30,
    DataAnalysisPhase.ROBUSTNESS_SWEEP:   15,
    DataAnalysisPhase.VALIDATE:            8,
    DataAnalysisPhase.CRITIQUE:            8,
    DataAnalysisPhase.ADVERSARIAL_REVIEW: 50,
    DataAnalysisPhase.REPORT:              4,
    DataAnalysisPhase.FOLLOW_UP:          40,
}
# 191 total orchestrator turns across all phases

UNDERSTAND gets 4 because schema profiling is mostly deterministic. EXECUTE gets 30 because real analysis requires room to iterate. ADVERSARIAL_REVIEW gets 50 because it runs an adversarial critic loop: the agent is specifically prompted to find flaws in prior findings, and that requires enough turns to exhaust the genuine possibilities before concluding there are none worth acting on.

The key design property: a phase cannot borrow turns from another. EXECUTE doesn't get 60 turns because the preceding phase only used 2. Budget resets to zero on every _advance_phase() call. This is what makes a phase machine auditable in a way a loop isn't. You can look at a completed session and know exactly how many turns each phase consumed, which phases hit their budget, and where the work concentrated.

On top of per-phase turn budgets, each phase carries an advisory token cap:

PHASE_TOKEN_CAPS = {
    DataAnalysisPhase.UNDERSTAND:          5_000,
    DataAnalysisPhase.PLAN_ANALYSIS:      50_000,
    DataAnalysisPhase.EXECUTE:           200_000,
    DataAnalysisPhase.ROBUSTNESS_SWEEP:  100_000,
    DataAnalysisPhase.ADVERSARIAL_REVIEW: 30_000,
    DataAnalysisPhase.REPORT:             50_000,
    DataAnalysisPhase.FOLLOW_UP:         100_000,
}

At 75% of EXECUTE's 200k cap, the system can surface "Extend budget?" to the user. Advisory caps degrade gracefully. Hard turn budgets enforce the phase boundary. Both serve different purposes and neither replaces the other.

The tool allowlist is where structural enforcement lives:

_PHASE_TOOL_ALLOWLIST = {
    "submit_objective_classification": frozenset({DataAnalysisPhase.CLASSIFY}),
    "submit_analysis_plan":            frozenset({DataAnalysisPhase.PLAN_ANALYSIS}),
    "seasonality_specialist":          frozenset({DataAnalysisPhase.EXECUTE, DataAnalysisPhase.FOLLOW_UP}),
    "correlation_specialist":          frozenset({DataAnalysisPhase.EXECUTE, DataAnalysisPhase.FOLLOW_UP}),
    "anomaly_specialist":              frozenset({DataAnalysisPhase.EXECUTE, DataAnalysisPhase.FOLLOW_UP}),
    "record_finding":                  frozenset({DataAnalysisPhase.EXECUTE, DataAnalysisPhase.VALIDATE, DataAnalysisPhase.FOLLOW_UP}),
    "finalize":                        frozenset({DataAnalysisPhase.EXECUTE, DataAnalysisPhase.VALIDATE, DataAnalysisPhase.REPORT}),
    "run_python":                      frozenset({DataAnalysisPhase.EXECUTE, DataAnalysisPhase.VALIDATE, DataAnalysisPhase.CRITIQUE, DataAnalysisPhase.FOLLOW_UP}),
}
# Tools absent from this dict are unrestricted

If the agent tries to call submit_analysis_plan during EXECUTE, the call is rejected before the model sees a result. This is not prompt-based enforcement ("please don't resubmit your plan during execution"). It's structural: the tool isn't available in that phase. The model cannot call what isn't there.

This distinction matters more than it sounds. Prompt-based constraints drift under load. An agent that has been running for 25 turns in EXECUTE is working with a different effective context than it was at turn 1. Long chains of tool calls and observations push earlier instructions out of effective range. Structural constraints don't drift. The allowlist check runs on every tool call regardless of what's in context.

Transitions and gates are not the same thing

This is the distinction that most human-in-the-loop discussions miss, and getting it wrong leads to systems that are either fully automatic (no user checkpoints) or frozen (every step requires user input).

A transition is code-driven. _advance_phase(EXECUTE) fires when the plan is approved and the workflow moves forward without pausing.

A gate is a blocking wait for human input. The workflow suspends until the user responds:

async def _open_gate(self, kind: str, ...) -> ResearchDecision:
    self._current_phase = f"gate_{kind}"
    self._awaiting_gate = True
    await workflow.wait_condition(
        lambda: self._pending_decision is not None or self._close_requested
    )
    # Returns ResearchDecision: approve | revise | cancel

Named gates in a full analytical session: schema fires during UNDERSTAND, objective during CLASSIFY, plan during PLAN_ANALYSIS, adversarial during ADVERSARIAL_REVIEW, and two interactive gates (clarify, ask_user) that can fire mid-EXECUTE when the agent needs user input to proceed.

The phase machine tracks these as distinct states at the string level:

self._current_phase = f"gate_{kind}"  # "gate_plan", "gate_objective", etc.

current_phase = "CLASSIFY" means the agent is running. current_phase = "gate_objective" means it has stopped and is waiting for a decision. These are genuinely different states and the frontend needs to surface them differently. An agent sitting at a gate is not stalled. It's at a deliberate checkpoint.

The practical consequence of the distinction: when a user revises at gate_plan, the orchestrator loops back within PLAN_ANALYSIS with the revision feedback. The plan is not locked yet. The agent can revise as many times as the user wants. Once the user approves and the code-driven transition to EXECUTE fires, the plan locks. No further changes to the methodology during analysis. The gate is the last moment revision is possible without triggering a new analysis run.

A system that only has transitions has no user checkpoints. A system that only has gates is unusable because every step requires human approval. The right design uses both, at the right moments.

Non-linear re-entry without arbitrary jumps

A strictly linear phase machine works well for first-pass analysis. It doesn't handle follow-up questions well, because not all follow-ups require the same re-entry point.

A user who asks "why did churn spike in Q3?" after reading a report about Q4 revenue trends needs new analysis: a fresh EXECUTE budget, a new set of findings. A user who asks "can you present the same findings in a table instead of prose?" needs nothing more than a fresh REPORT budget with a different formatting instruction. Sending both to EXECUTE wastes 30 turns on a question that needs 4.

The FOLLOW_UP re-entry set handles this:

FOLLOW_UP_REENTRY = frozenset({
    DataAnalysisPhase.EXECUTE,
    DataAnalysisPhase.ROBUSTNESS_SWEEP,
    DataAnalysisPhase.VALIDATE,
    DataAnalysisPhase.CRITIQUE,
    DataAnalysisPhase.ADVERSARIAL_REVIEW,
    DataAnalysisPhase.REPORT,
})

From FOLLOW_UP, the machine can jump to any phase in this set with a fresh budget for that phase. It cannot jump to UNDERSTAND or CLASSIFY. The graph is constrained. Illegal transitions raise ValueError, which is different from a loop where any tool call is legal at any time.

def _advance_phase(self, target: DataAnalysisPhase) -> None:
    """
    Legal moves:
      - Forward by exactly one step in PHASE_SEQUENCE.
      - From FOLLOW_UP to any phase in FOLLOW_UP_REENTRY (re-entry).
    Raises ValueError on illegal moves.
    Appends (target, iso_timestamp) to _phase_history.
    Resets orchestrator budget for the new phase.
    """

The re-entry point decision is made by the phase machine based on the follow-up question type, not by the model mid-execution. Non-linear doesn't mean the model can jump wherever it wants. It means the legal move set is wider at specific points, for specific reasons.

Budget pressure as instruction

The naive response to budget exhaustion is a hard kill. The agent hits its token limit, the activity terminates, and the output is whatever the model had generated up to that point. Usually mid-sentence. Always incomplete. Unusable in most production contexts.

A better approach steers the agent toward completion before the limit hits. Two pressure thresholds:

BUDGET_CONSOLIDATE_FRACTION = 0.70
BUDGET_FINALIZE_FRACTION    = 0.90

def pressure(self) -> str | None:
    f = self.consumed_fraction
    if f >= BUDGET_FINALIZE_FRACTION:
        return "[BUDGET %d/%d] Call finalize NOW — do not call more tools."
    if f >= BUDGET_CONSOLIDATE_FRACTION:
        return "[BUDGET %d/%d] Approaching limit — consolidate and prepare to finalize."
    return None

At 70% of the phase budget consumed, the agent gets "consolidate and prepare to finalize." At 90%, "call finalize NOW." This is injected into the next turn's system prompt. The agent reads it and adjusts behavior. No special control flow. No new tool. The budget state drives the instruction.

In practice: an agent approaching its EXECUTE budget stops calling new analytical tools, consolidates findings it already has, and calls finalize to close the phase cleanly. The difference in output quality between a budget-pressured finalization and a hard-kill termination is significant. Hard kills produce fragments. Pressure-guided finalization produces a complete, if compressed, output.

Two budgets coexist in every session:

ANALYSIS_MAX_TURNS:           int   = 150    # total turns across entire session
BUDGET_CONSOLIDATE_FRACTION:  float = 0.70
BUDGET_FINALIZE_FRACTION:     float = 0.90
FINALIZE_HEADROOM_TURNS:      int   = 4

Per-phase budgets reset on every transition. The global session budget never resets. An agent can burn through EXECUTE's 30-turn budget, transition to VALIDATE, and the global budget continues tracking cumulative cost across every phase. The global cap of 150 is the binding limit on an efficient session; the 191 per-phase allocation only comes into play when individual phases hit their own budgets before the global total is reached. On a session where every phase runs to its maximum, the global cap binds first. Per-phase budgets matter most for preventing any single phase from consuming the bulk of that 150-turn global allowance.

What a full session looks like

A single analysis session, with turn counts:

UNDERSTAND opens. Schema profiling runs. 4 turns. A gate fires: the user confirms the schema interpretation or corrects it before the agent builds a plan on top of a wrong assumption. Code-driven transition to CLARIFY (if needed) or directly to CLASSIFY.

CLASSIFY runs for up to 6 turns, determining the analysis type: inferential, diagnostic, predictive, or causal. Gate fires: the user confirms the objective classification. Transition to PLAN_ANALYSIS.

PLAN_ANALYSIS gets 12 turns to build a typed analysis plan with enumerated steps, a decision rule, and a falsifiability statement. Gate fires: the user reviews the plan. Revisions loop PLAN_ANALYSIS until the user approves. Approval locks the plan and transitions to EXECUTE.

EXECUTE gets 30 turns. At turn 21, the first pressure note fires. At turn 27, if the agent hasn't finalized, the second note fires. The phase closes with a complete set of labeled findings.

VALIDATE and CRITIQUE run for 8 turns each, checking findings against the data and the plan. ADVERSARIAL_REVIEW gets 50 turns for a full adversarial pass: the agent is prompted to find flaws, not confirm them. Gate fires: the user sees the adversarial verdict before the report builds.

REPORT runs in 4 turns. User reads. A follow-up question arrives. The question requires new analysis: the machine re-enters EXECUTE with a fresh 30-turn budget. A question that's purely about report framing re-enters REPORT with a fresh 4-turn budget.

Total budget: 150 turns at the global session level, which is always the binding ceiling. The 191 per-phase total is a theoretical maximum that only applies when phases hit their individual budgets before the global does. Every transition is logged to an append-only phase history with timestamps:

self._phase_history: list[tuple[DataAnalysisPhase, str]] = []  # (phase, iso_timestamp)

The full execution is auditable after the fact: which phase ran when, how long it took between transitions, whether any gates were opened and how long they sat waiting for user input. That kind of operational visibility doesn't exist in a loop because a loop has no phase boundaries to record.

The design decision nobody publishes

The structural pattern here, named phases with budgets and legal move sets, is not new. Academic work on structured LLM workflows describes similar ideas in the context of SOPs. That paper covers step sequencing and procedural correctness. It doesn't include a budget allocation table: concrete turn and token numbers per phase, and the reasoning behind setting them where they land. That's the part missing here, and it's the part that actually determines whether the design holds up under a real production load instead of just on paper.

ADVERSARIAL_REVIEW at 50 turns sounds expensive until you've watched an unconstrained adversarial loop spend 150 turns finding the same three issues it found on turn 5. UNDERSTAND at 4 turns sounds tight until you realize that anything requiring more than 4 turns to profile a schema is probably doing something wrong in that phase.

How do you currently decide when a phase is done? If the answer is "the model decides," you have a loop, and the costs and quality failures that follow are structural rather than incidental. The interesting engineering question is what allocation you'd arrive at for your specific workflow, and whether the numbers look anything like these.