Inference-Healed Code Review Reward

Problem

Simple reward functions that only check for "all tests passed" fail to capture nuanced code quality issues (e.g., performance regressions, style violations, missing edge-case handling). A single binary signal at the end cannot guide the agent to produce maintainable, high-quality code.

Verifying only final correctness misses suboptimal commits (e.g., a patch that removes error handling but still passes tests).
Reward models that produce a single scalar lack explainability to tell the agent which aspect of the code needs improvement.

Solution

Use an inference-healed reward model—a code-review critic that:

1. Decomposes Code Quality into Subcriteria - Correctness: Does the code pass all existing and newly added tests? - Style: Are linters (e.g., ESLint, pylint) satisfied (zero or minimal warnings)? - Performance: Are there clear performance regressions gauged by simple benchmarks? - Security: Does a static analyzer (e.g., Bandit, SonarQube) flag no critical issues?

2. Runs Internal Chain-of-Thought (CoT) Reasoning - If uncertain about a subcriterion (e.g., performance), the critic runs a short CoT inside itself: text "Step: performance check. Baseline runtime: 50ms. New code runtime: 65ms. Regression > 20%. Score: 0.4." - This "inference healing" allows the reward model to explain each sub-score.

3. Aggregates Subscores - Each subcriterion returns a float ∈ [0, 1]. - A weighted sum (e.g., 0.4 × correctness + 0.2 × style + 0.2 × performance + 0.2 × security) yields the final code-review score.

4. Generates Human-Readable Feedback - Alongside a numerical score, return a short analysis: json { "correctness": 1.0, "style": 0.8, "performance": 0.4, "security": 0.6, "comments": "Performance regression due to O(n²) loop." }

Example

# Pseudo-code for one code-review reward invocation
subscores = {
    "correctness": test_critic.score(patch),
    "style": linter_critic.score(patch),
    "performance": perf_critic.score(patch),
    "security": security_critic.score(patch),
}
final_score = sum(weight[k]*subscores[k] for k in subscores)
return final_score, subscores, comments

How to use it

Critic Dataset Collection: Gather examples of good vs. bad code patches, labeled along each subcriterion.
Critic Training: Fine-tune a small LLM (e.g., 1–2 B parameters) to produce sub-scores and CoT justifications.
Integration into RL Loop: Replace or augment the existing binary "tests-passed" reward with inference_healed_reward(patch).
Human-in-the-Loop Checkpoints: If a patch is borderline (e.g., final_score ∈ [0.5, 0.7]), route it for manual code review to generate better labels for future training.

Trade-offs

Pros:
Explainable Feedback: The agent knows why a patch scored poorly, allowing targeted improvements.
Higher Code Quality: Incorporates non-functional criteria (performance, security), leading to more robust code.
Cons/Considerations:
Compute Overhead: Each reward invocation may involve running tests, linters, benchmarks, and a static analysis, adding latency.
Critic Maintenance: As coding standards or security rules evolve, retrain or update the critic models and rubrics.

References

Derived from "inference healing" in reward modeling, as discussed in the Open Source Agent RL talk (May 2025) and by Will Brown (Prime Intellect).
Similar principles in "Criterion-Led Reward Models" (DeepMind blog, April 2025).