Skip to content

Explicit Posterior-Sampling Planner NEW

Problem

Agents that rely on ad-hoc heuristics explore poorly, wasting tokens and API calls on dead ends.

Solution

Embed a fully specified RL algorithm—Posterior Sampling for Reinforcement Learning (PSRL)—inside the LLM's reasoning:

  • Maintain a Bayesian posterior over task models.
  • Sample a model, compute an optimal plan/policy, execute, observe reward, update posterior.
  • Express each step in natural language so the core LLM can carry it out with tool calls.

How to use it

Wrap the algorithm in a reusable prompt template or code skeleton the LLM can fill.

References

  • Arumugam & Griffiths, Toward Efficient Exploration by LLM Agents