Skip to content

Inference-Healed Code Review Reward

Problem

Simple reward functions that only check for "all tests passed" fail to capture nuanced code quality issues (e.g., performance regressions, style violations, missing edge-case handling). A single binary signal at the end cannot guide the agent to produce maintainable, high-quality code.

  • Verifying only final correctness misses suboptimal commits (e.g., a patch that removes error handling but still passes tests).
  • Reward models that produce a single scalar lack explainability to tell the agent which aspect of the code needs improvement.

Solution

Use an inference-healed reward model—a code-review critic that:

1. Decomposes Code Quality into Subcriteria - Correctness: Does the code pass all existing and newly added tests? - Style: Are linters (e.g., ESLint, pylint) satisfied (zero or minimal warnings)? - Performance: Are there clear performance regressions gauged by simple benchmarks? - Security: Does a static analyzer (e.g., Bandit, SonarQube) flag no critical issues?

2. Runs Internal Chain-of-Thought (CoT) Reasoning - If uncertain about a subcriterion (e.g., performance), the critic runs a short CoT inside itself: text "Step: performance check. Baseline runtime: 50ms. New code runtime: 65ms. Regression > 20%. Score: 0.4." - This "inference healing" allows the reward model to explain each sub-score.

3. Aggregates Subscores - Each subcriterion returns a float ∈ [0, 1]. - A weighted sum (e.g., 0.4 × correctness + 0.2 × style + 0.2 × performance + 0.2 × security) yields the final code-review score.

4. Generates Human-Readable Feedback - Alongside a numerical score, return a short analysis: json { "correctness": 1.0, "style": 0.8, "performance": 0.4, "security": 0.6, "comments": "Performance regression due to O(n²) loop." }

Example

# Pseudo-code for one code-review reward invocation
subscores = {
    "correctness": test_critic.score(patch),
    "style": linter_critic.score(patch),
    "performance": perf_critic.score(patch),
    "security": security_critic.score(patch),
}
final_score = sum(weight[k]*subscores[k] for k in subscores)
return final_score, subscores, comments

How to use it

  • Critic Dataset Collection: Gather examples of good vs. bad code patches, labeled along each subcriterion.
  • Critic Training: Fine-tune a small LLM (e.g., 1–2 B parameters) to produce sub-scores and CoT justifications.
  • Integration into RL Loop: Replace or augment the existing binary "tests-passed" reward with inference_healed_reward(patch).
  • Human-in-the-Loop Checkpoints: If a patch is borderline (e.g., final_score ∈ [0.5, 0.7]), route it for manual code review to generate better labels for future training.

Trade-offs

  • Pros:
  • Explainable Feedback: The agent knows why a patch scored poorly, allowing targeted improvements.
  • Higher Code Quality: Incorporates non-functional criteria (performance, security), leading to more robust code.
  • Cons/Considerations:
  • Compute Overhead: Each reward invocation may involve running tests, linters, benchmarks, and a static analysis, adding latency.
  • Critic Maintenance: As coding standards or security rules evolve, retrain or update the critic models and rubrics.

References

  • Derived from "inference healing" in reward modeling, as discussed in the Open Source Agent RL talk (May 2025) and by Will Brown (Prime Intellect).
  • Similar principles in "Criterion-Led Reward Models" (DeepMind blog, April 2025).