Skip to content

Self-Critique Evaluator Loop NEW

Problem

Human preference labels are costly and quickly become outdated as base models improve.

Solution

Train a self-taught evaluator that bootstraps from synthetic data:

Generate multiple candidate outputs for an instruction.
Ask the model to judge and explain which is better (reasoning trace).
Fine-tune that judge on its own traces; iterate.
Use the judge as a reward model or quality gate for the main agent.
Periodically refresh with new synthetic debates to stay ahead of model drift.

Pros & Cons

Pros: near-human eval accuracy without labels; scales with compute.
Cons: risk of evaluator-model collusion; needs adversarial tests.

References

Wang et al., Self-Taught Evaluators