Semantic Context Filtering Pattern

Nikola Balic (@nibzard)

Problem

Raw data sources are too verbose and noisy for effective LLM consumption. Full representations include invisible elements, implementation details, and irrelevant information that bloat context and confuse reasoning.

Research on boilerplate detection shows that 40-80% of web page content is typically navigation, footers, ads, and other boilerplate that should be filtered before semantic processing (Kohlschütter et al., SIGIR 2010).

This creates several problems:

Token explosion: Raw data exceeds context limits or becomes prohibitively expensive
Poor signal-to-noise: LLM wastes reasoning capacity on irrelevant details
Slower inference: More tokens = slower generation and higher costs
Confused reasoning: Noise leads to hallucinations or wrong conclusions

The issue appears across domains:

Web scraping: Full HTML DOM includes scripts, styles, tracking iframes
API responses: JSON with nested metadata, internal fields, debug info
Document processing: Headers, footers, navigation, boilerplate text
Code analysis: Comments, whitespace, boilerplate code

Solution

Extract only the semantic, interactive, or relevant elements from raw data. Filter out noise and provide the LLM with a clean representation that captures what matters for reasoning.

Core Principle

Don't send raw data to the LLM. Send semantic abstractions.

This approach is validated across production systems including browser automation tools (Puppeteer/Playwright accessibility trees), RAG frameworks (LangChain, LlamaIndex semantic chunking), and code analysis tools (Aider's AST-based repo-map).

Example 1: Browser Accessibility Tree

Instead of full HTML DOM:

<!-- Raw HTML (10,000+ tokens) -->
<html>
  <head>
    <script src="analytics.js"></script>
    <style>body { margin: 0; }</style>
  </head>
  <body>
    <div class="tracking-pixel" style="display:none"></div>
    <iframe src="ad-server.com"></iframe>
    <nav aria-label="Navigation">
      <a href="/">Home</a>
      <a href="/about">About</a>
    </nav>
    <main>
      <button id="login-button">Login</button>
      <input type="email" name="email" placeholder="Email" />
    </main>
    <footer>Copyright 2024</footer>
  </body>
</html>

Extract the accessibility tree (100-200 tokens):

{
  "interactiveElements": [
    {
      "role": "link",
      "name": "Home",
      "xpath": "/html/body/nav/a[1]"
    },
    {
      "role": "link",
      "name": "About",
      "xpath": "/html/body/nav/a[2]"
    },
    {
      "role": "button",
      "name": "Login",
      "id": "login-button",
      "xpath": "/html/body/main/button"
    },
    {
      "role": "textbox",
      "name": "Email",
      "id": "email",
      "xpath": "/html/body/main/input"
    }
  ]
}

Implementation:

// Use browser's built-in accessibility tree
const tree = await page.accessibility.snapshot({
  interestingOnly: true  // Only interactive elements
});

// Automatically filters:
// - Elements with aria-hidden="true"
// - Elements with display:none
// - Ad/tracking iframes by domain
// - Non-semantic divs and spans

Example 2: API Response Filtering

Raw API responses often include internal metadata:

// Raw API response (2,000 tokens)
{
  "data": {
    "users": [
      {
        "id": "123",
        "name": "Alice",
        "email": "alice@example.com",
        "internalFlags": ["vip", "beta_tester"],
        "metadata": {
          "created_at": "2024-01-01",
          "updated_at": "2024-01-15",
          "version": 42
        }
      }
    ]
  },
  "_internal": {
    "requestId": "req-abc123",
    "latency": 45,
    "cache": "HIT",
    "debug": []
  },
  "_links": {
    "self": "/users",
    "next": "/users?page=2"
  }
}

Filter to semantic fields only:

function filterAPIResponse(response: any, schema: FieldSchema): any {
  const filtered = {};

  for (const field of schema.relevantFields) {
    if (response.data?.[field]) {
      filtered[field] = response.data[field];
    }
  }

  return filtered;
}

// Result (200 tokens):
{
  "users": [
    {
      "name": "Alice",
      "email": "alice@example.com"
    }
  ]
}

Example 3: Document Section Extraction

Full documents include boilerplate:

FULL DOCUMENT:
====================
COMPANY CONFIDENTIAL [Header repeated on every page]
Copyright 2024 Acme Corp. All rights reserved.

[Legal disclaimer spanning 3 pages]

EXECUTIVE SUMMARY
====================
The Q4 revenue increased by 15%...

[Navigation menu]
- Table of Contents
- Index
- Glossary

ACTUAL CONTENT STARTS HERE
====================
Analysis of market trends shows...

[50 more pages]

APPENDIX
========
[Technical specifications]
[Legal disclaimers]
[Contact information - repeated]

Extract semantic sections:

def extract_semantic_content(document: str) -> dict:
    # Skip headers, footers, navigation
    sections = {
        "executive_summary": extract_section(document, "EXECUTIVE SUMMARY"),
        "analysis": extract_section(document, "ANALYSIS"),
        "conclusions": extract_section(document, "CONCLUSIONS"),
    }

    # Remove boilerplate
    for section in sections.values():
        section = remove_legal_disclaimers(section)
        section = remove_navigation(section)

    return sections

# Result: Only the actual content, ~20% of original size

Architecture

graph LR A[Raw Data Source] --> B[Semantic Filter] B --> C[Clean Context] subgraph "Filter Layer" B --> D[Interactive Elements] B --> E[Relevant Fields] B --> F[Semantic Sections] end C --> G[LLM Processing] A -. "Noise removed" .-> B B -. "10-100x reduction" .-> C style C fill:#9f9,stroke:#333 style A fill:#f99,stroke:#333

Key Benefits

Aspect	Raw Data	Semantic Filter	Improvement
Token count	10,000	100-1,000	10-100x reduction
LLM reasoning	Confused by noise	Focused on signal	Better decisions
Cost	High	Low	10-100x cheaper
Latency	Slow	Fast	2-5x faster
Accuracy	Prone to errors	More reliable	Higher success rate

How to use it

1. Identify Semantic Elements

For your domain, determine what actually matters:

// Web scraping: interactive elements only
const semanticElements = [
  'button', 'link', 'textbox', 'checkbox',
  'radio', 'combobox', 'slider'
];

// API responses: business data only
const relevantFields = [
  'name', 'email', 'status', 'amount'
];

// Documents: content sections only
const contentSections = [
  'executive_summary', 'analysis', 'conclusions'
];

2. Build Filter Layer

class SemanticFilter {
  filter(data: any, domain: string): any {
    switch (domain) {
      case 'web':
        return this.filterAccessibilityTree(data);
      case 'api':
        return this.filterAPIResponse(data);
      case 'document':
        return this.filterDocumentSections(data);
    }
  }

  private filterAccessibilityTree(dom: any): any {
    // Only interactive elements with ARIA roles
    return dom
      .filter(el => el.interactive)
      .filter(el => !el.isHidden)
      .filter(el => !this.isAdIframe(el))
      .map(el => ({
        role: el.role,
        name: el.name,
        xpath: el.xpath
      }));
  }
}

3. Apply Before LLM Call

// Wrong: Send raw data
const response = await llm.generate({
  prompt: `Analyze this page: ${rawHTML}`
});

// Right: Filter first
const filtered = semanticFilter.filter(rawHTML, 'web');
const response = await llm.generate({
  prompt: `Analyze this page: ${JSON.stringify(filtered)}`
});

4. Maintain Reference Mapping

Keep track of filtered-to-original mappings for execution:

interface FilteredElement {
  semanticId: string;    // "login-button"
  originalRef: string;   // "frameIndex-backendNodeId"
  xpath: string;         // "/html/body/main/button"
}

// Filtered context uses semantic IDs
const filteredContext = [
  { id: "btn-1", name: "Login", role: "button" }
];

// Execution layer maps back to original references
const element = mapToOriginal(filteredContext[0].id);
await page.click(element.xpath);

Trade-offs

Pros:

Dramatic token reduction: 10-100x smaller context
Better LLM reasoning: Focus on signal, not noise
Lower costs: Fewer tokens = cheaper
Faster inference: Smaller context = faster generation
Higher reliability: Less confusion and hallucination

Cons:

Filter complexity: Need to build and maintain filter logic
Information loss: May remove context that matters
Domain-specific: Filters need to be tailored per use case
Mapping overhead: Need to track filtered-to-original references
Potential bugs: Filter might remove important elements

Edge cases to handle:

Hidden but content-rich: Accordions, tab panels, and collapsed content may be excluded by accessibility tree
Dynamic content: AJAX-loaded content, infinite scroll, and lazy-loaded elements require wait/scroll strategies
Canvas/SVG: Charts and custom-rendered content may need OCR or fallback HTML

Mitigation strategies:

Start conservative: Filter obvious noise, include borderline cases
Add filter bypass for debugging
Monitor LLM performance: Expand filter if accuracy drops
Version filters alongside data schemas
Provide hints to LLM: "Context has been filtered for relevance"

Security note: Semantic extraction can also provide security benefits. By removing untrusted content after extracting safe intermediate representations, agents gain resistance to prompt injection (see: Context-Minimization Pattern).

References

HyperAgent GitHub Repository - Original accessibility tree implementation
Kohlschütter et al., "Boilerplate Detection using Shallow Text Features", SIGIR 2010 - Foundational research showing 40-80% of web content is boilerplate
Beurer-Kellner et al., "Design Patterns for Securing LLM Agents", arXiv 2025 - Context-Minimization Pattern (security framework)
WAI-ARIA Accessibility Tree - Browser accessibility API
Related patterns: Context Window Anxiety Management, Curated Context Windows

Source: https://github.com/hyperbrowserai/HyperAgent