Reliability Problem Map Checklist for RAG and Agents

Problem

RAG pipelines and agent systems often fail in ways that are hard to diagnose: missing context, unstable retrieval, brittle tool contracts, and flaky behavior after data updates.

Teams frequently address these failures by iterating on prompts or tuning model settings first, which makes incidents feel random and expensive to fix.

This pattern addresses the need for a shared, repeatable triage routine that turns vague failures into actionable repair paths. Research shows structured incident data correlates with better reliability outcomes (ACM SIGOPS 2022, IEEE ISSRE 2019).

Solution

Use a fixed reliability checklist (the Problem Map) as the first step in every incident response. The checklist groups recurring failure classes across these four areas:

Retrieval behavior
Vector / index behavior
Prompt and tool contracts
Deployment and operational state

For each failing case, the team classifies the incident by answering the checklist items. Confirmed failures are then mapped to a small set of repair actions and re-tested against the same reproduction case.

This creates a consistent vocabulary for incidents and keeps responses structured instead of ad hoc.

How to use it

Capture one failing trace, query, or conversation.
Run through the 16-question Problem Map against that case.
Mark the active failure mode(s).
Execute the repair actions associated with those modes (for example: adjust chunking, rebuild embeddings/indexes, correct prompt/tool contracts, or repair ingestion ordering).
Re-run the same failure case and record which checks are now resolved.

Over time this becomes a lightweight incident memory bank for the team.

Trade-offs

Pros:

Gives teams a shared language for common agent/RAG failures.
Works across LLM providers, orchestration stacks, and vector stores.
Encourages targeted fixes instead of blind prompt changes.
Produces a repeatable incident history.

Cons/Considerations:

Requires discipline to run the full triage sequence first.
Not a replacement for automated evaluations or metrics.
Repair actions still need to be maintained for each tech stack.

Example

Consider a RAG system where answers cite the wrong snippet despite high vector similarity scores. Traditional debugging would involve tuning retriever weights or adjusting prompts.

Using the WFGY Problem Map:

Run through the checklist and identify Problem #1 (Hallucination & Chunk Drift) and Problem #5 (Semantic != Embedding)
For Problem #1: Apply Delta S meter to measure semantic tension (>0.6 indicates failure), use lambda_observe to flag divergent logic flow
For Problem #5: Apply BBMC residue minimization to compute semantic residue, reject chunks with high tension
Re-run the same query and verify Delta S <= 0.45, coverage >= 0.70

Result: The system now explicitly rejects misleading chunks before generation, creating a "semantic firewall" rather than patching after output.

References

WFGY Problem Map README
WFGY Problem Map Repository
Technical Deep Dive Report
Semantic Clinic Index
Grandma's Clinic (Beginner-Friendly)
"Agentic Retrieval-Augmented Generation: A Survey" (arXiv:2501.09136, 2025)