Self-Critique Evaluator Loop NEW
Problem
Human preference labels are costly and quickly become outdated as base models improve.
Solution
Train a self-taught evaluator that bootstraps from synthetic data:
- Generate multiple candidate outputs for an instruction.
- Ask the model to judge and explain which is better (reasoning trace).
- Fine-tune that judge on its own traces; iterate.
- Use the judge as a reward model or quality gate for the main agent.
- Periodically refresh with new synthetic debates to stay ahead of model drift.
Pros & Cons
- Pros: near-human eval accuracy without labels; scales with compute.
- Cons: risk of evaluator-model collusion; needs adversarial tests.
References
- Wang et al., Self-Taught Evaluators