Skip to content

Tool Use Incentivization via Reward Shaping UPDATED

Nikola Balic (@nibzard)

Problem

Coding agents often underutilize specialized tools (e.g., compilers, linters, test runners) when left to optimize only for final task success. They default to "thinking" tokens—generating internal chain-of-thought—instead of invoking external tools, which can slow down development and lead to suboptimal code outputs.

  • Models like R1 "use their think tokens" almost exclusively rather than calling tools unless explicitly rewarded for tool use.
  • Without intermediate incentives, the agent has no incentive to write code, compile, or run tests until the very end.
  • Sparse final rewards provide insufficient signal for learning optimal tool-use patterns across multi-step episodes.

Solution

Provide dense, shaped rewards for every intermediate tool invocation that contributes toward final code correctness. Key components:

1. Define Tool-Specific Reward Signals - Compile Reward: +1 if code compiles without errors. - Lint Reward: +0.5 if linter returns zero issues. - Test Reward: +2 if test suite passes a new test case. - Documentation Reward: +0.2 for adding or correcting docstrings. - Efficiency Reward: +0.1 for parallelizing independent tool calls; -0.05 for redundant invocations. - Format Reward: +0.2 for proper tool invocation schema compliance.

2. Episode-Level Aggregation - Sum intermediate rewards to form a cumulative "coding progress" score. - Combine with final reward (e.g., full test suite pass or PR merge) to guide policy updates. - Use turn-level credit assignment to attribute rewards correctly across multi-step tool sequences.

3. Policy Update Mechanism - Use Proximal Policy Optimization (PPO), Advantage Actor-Critic (A2C), or GRPO with these shaped rewards. - During each RL rollout, track (state, action, tool_result, local_reward) tuples.

# Pseudo-code: at each RL step, after tool call:
if action == "compile":
    local_reward = 1 if compile_success else -0.5
elif action == "run_tests":
    local_reward = 2 if new_tests_passed else 0
elif action == "parallel_tool_batch":
    local_reward = 0.1  # efficiency bonus

# Check for redundant calls
if is_redundant_call(action, history):
    local_reward -= 0.05

trajectory.append((state, action, tool_output, local_reward))

How to use it

  • Instrumentation: Wrap tool calls (e.g., compile(), run_linter(), pytest) with functions that return a binary or graded success signal.
  • Hyperparameter Tuning: Adjust reward magnitudes so that the agent does not "overfit" to one tool (e.g., getting lint rewards repeatedly without actual functionality).
  • Curriculum Design: Start with simpler tasks (e.g., "fix one failing test") to collect early positive signals and gradually scale to multi-file refactors.
  • Multi-Criteria Grading: Use weighted combinations of correctness, format, tool-use quality, and efficiency to prevent reward hacking.
  • RLAIF for Scalability: Consider AI-generated feedback (vs. human labels) for cost-effective reward signal generation at scale.

Trade-offs

  • Pros:
  • Denser Feedback: Guides the agent step by step, reducing reliance on sparse, final success signals.
  • Tool Adoption: Encourages the agent to learn how and when to invoke compilers and test runners.
  • Cons/Considerations:
  • Reward Engineering Overhead: Requires careful design and maintenance of reward functions for each tool.
  • Potential Overfitting: The agent may game intermediate rewards (e.g., repeatedly running lint without changing code).

References

  • Will Brown's discussion on how "if you set these models up to use tools, they just won't" unless incentivized.
  • Concepts from "Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment" (Prime Intellect paper previewed in talk).
  • Lightman et al. (2023). "Process-Based Reward Models for Large Language Models." NeurIPS 2023.
  • Shao et al. (2024). "DeepSeekMath: Pushing the Limits of Mathematical Reasoning." Introduces GRPO for step-by-step reasoning.
  • Yao et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." NeurIPS 2022.

  • Primary source: https://www.youtube.com/watch?v=Xkwok_XXQgw