Reflection, Reflexion, and Self-Refine: Iterative Self-Correction

A family of patterns in which an agent critiques and revises its own output. Three named variants in the literature; one common structure underneath.

Definition

Reflection (pattern family)

A class of agent patterns in which the system produces a candidate output, critiques it (using the same model, a different model, or an external signal), generates a revised output, and repeats until convergence or an iteration budget is exhausted.

Three named variants appear in the recent literature: Self-Refine (Madaan et al. 2023), single-model iterative improvement; Reflexion (Shinn et al. 2023), verbal reinforcement learning with explicit memory of past failures; and reflection (generic, framework-level), an outer critique loop wrapped around any inner agent.

The family

Self-Refine

Madaan et al. 2023 introduced Self-Refine as a single-model iterative improvement loop. The model generates output, then prompts itself with a critique prompt to identify weaknesses, then generates a revised output incorporating the critique. The loop continues until the model judges further refinement unnecessary or an iteration cap is hit.

The original paper reports gains across seven tasks (math, code optimisation, dialogue response, sentiment reversal, acronym generation, code readability, constrained generation), with the largest improvements on tasks where a clear constraint or rubric exists and the smallest on open-ended generation tasks.

Reflexion

Shinn et al. 2023 extended the idea to multi-episode learning. After each task attempt, the agent verbalises what went wrong (a self-reflection in natural language) and stores that reflection in an explicit memory. On the next attempt of the same or a similar task, the prior reflections are added to the prompt as context.

The framing in the paper is verbal reinforcement learning: the reflection memory plays the role of a learned policy, but encoded as natural-language text rather than gradient updates. The advantage is that the lesson is human-readable and immediately usable; the limitation is that the “policy” is whatever the model can articulate about its own failures, which is not always the same as the actual cause.

Generic reflection (framework-level)

Beyond the named papers, framework documentation uses “reflection” loosely to mean any outer critique loop: a separate LLM call that judges the inner agent’s output, an automated test that checks the result, a human-review step that gates the next iteration. LangGraph’s reflection examples and CrewAI’s reviewer-role pattern both implement this generic shape.

The common structure

All three variants share four steps: produce a candidate output, critique it (same model, separate model, deterministic check), revise the output incorporating the critique, and iterate until convergence or budget exhausted.

Figure. Reflection as an outer loop wrapping any inner agent. The dashed line is the iteration; up to a hard cap.

When it helps and when it does not

The empirical record across the source papers and follow-up work points to a clear pattern. Reflection helps most on tasks with three properties: a clear external signal exists (tests pass, a schema validates, a constraint is met or not met); the model is good at recognising the failure but struggled to produce the right output first time; and the iteration budget is generous enough for two or three rounds.

Reflection helps least on open-ended generation tasks (creative writing, persuasive argument), tasks where the model’s critique is as miscalibrated as its initial output (the “blind leading the blind” failure), and tasks where the cost of an additional iteration exceeds the value of the marginal quality improvement.

The cost problem

Each reflection iteration is one or more additional LLM calls. A three-iteration reflection loop with a separate critic model multiplies token cost by roughly four (the original generation, the critic call, the revised generation, repeated). On long reasoning tasks this matters. Production deployments almost always cap iterations at two or three and add an early-exit condition that stops the loop when the critique reports no significant issues.

The cost-quality trade-off is explicit and worth measuring on the actual task: many benchmarks see most of the quality lift in the first revision, with diminishing returns after.

Composition with other patterns

Reflection composes naturally with both ReAct and plan-and-execute. The most common production architecture wraps a reflection loop around a ReAct inner loop: ReAct produces an answer through interleaved reasoning and acting; the reflection layer judges the final answer and either accepts it or sends the agent back. For plan-and-execute, reflection often sits between the plan and the execution, judging the plan before any action is taken; the agent re-plans if the critique flags issues.

See the limitations reference for the autonomy implications: a reflection layer is one of the cheapest ways to insert a safety check before destructive actions.

Frequently asked questions

What is the difference between Self-Refine and Reflexion?

Self-Refine (Madaan et al. 2023) is single-model iterative improvement within a single task: produce an output, critique it, revise, repeat. Reflexion (Shinn et al. 2023) adds an explicit memory of past failures across episodes, so the agent learns from previous attempts at similar tasks. Both share the produce-critique-revise structure; Reflexion adds episodic memory.

Does reflection always improve agent quality?

No. The original Self-Refine paper reports gains on math, code generation, and dialogue tasks, but explicitly notes failure modes: when the model’s self-critique is as wrong as its initial output, the iteration cannot help. Empirically the pattern works best when there is a clear external signal (a test that passes or fails, a constraint that can be checked) and worst on open-ended tasks where the critique step is just another opinion.

How many iterations are typical in production?

Two to three. Each iteration multiplies token cost; the marginal quality gain falls off after a few rounds in most reported experiments. Production deployments hard-cap iterations on a token-budget basis rather than letting the loop run unbounded.

Is reflection compatible with ReAct?

Yes, and the most common composition is exactly that: a ReAct inner loop produces a candidate answer; an outer reflection loop critiques the answer and either accepts it or sends the agent back for another ReAct pass. The two patterns operate at different time scales (ReAct per-step, reflection per-output) and combine cleanly.

Related references

ReAct - the typical inner loop
Evaluation tooling - the LLM-as-judge category
Honest Limitations - reflection as a safety mechanism

Sources and Further Reading

A. Madaan et al., Self-Refine: Iterative Refinement with Self-Feedback, arXiv:2303.17651 (2023).
N. Shinn et al., Reflexion: Language Agents with Verbal Reinforcement Learning, arXiv:2303.11366 (2023).
S. Yao et al., ReAct: Synergizing Reasoning and Acting, arXiv:2210.03629 (2022).
L. Wang et al., A Survey on LLM-based Autonomous Agents, arXiv:2308.11432 (2023).
LangGraph, Reflection tutorial.
CrewAI, documentation, reviewer-role examples.