Skip to content

Semantic Context Filtering Pattern NEW

Nikola Balic (@nibzard)

Problem

Raw data sources are too verbose and noisy for effective LLM consumption. Full representations include invisible elements, implementation details, and irrelevant information that bloat context and confuse reasoning.

This creates several problems:

  • Token explosion: Raw data exceeds context limits or becomes prohibitively expensive
  • Poor signal-to-noise: LLM wastes reasoning capacity on irrelevant details
  • Slower inference: More tokens = slower generation and higher costs
  • Confused reasoning: Noise leads to hallucinations or wrong conclusions

The issue appears across domains:

  • Web scraping: Full HTML DOM includes scripts, styles, tracking iframes
  • API responses: JSON with nested metadata, internal fields, debug info
  • Document processing: Headers, footers, navigation, boilerplate text
  • Code analysis: Comments, whitespace, boilerplate code

Solution

Extract only the semantic, interactive, or relevant elements from raw data. Filter out noise and provide the LLM with a clean representation that captures what matters for reasoning.

Core Principle

Don't send raw data to the LLM. Send semantic abstractions.

Example 1: Browser Accessibility Tree

Instead of full HTML DOM:

<!-- Raw HTML (10,000+ tokens) -->
<html>
  <head>
    <script src="analytics.js"></script>
    <style>body { margin: 0; }</style>
  </head>
  <body>
    <div class="tracking-pixel" style="display:none"></div>
    <iframe src="ad-server.com"></iframe>
    <nav aria-label="Navigation">
      <a href="/">Home</a>
      <a href="/about">About</a>
    </nav>
    <main>
      <button id="login-button">Login</button>
      <input type="email" name="email" placeholder="Email" />
    </main>
    <footer>Copyright 2024</footer>
  </body>
</html>

Extract the accessibility tree (100-200 tokens):

{
  "interactiveElements": [
    {
      "role": "link",
      "name": "Home",
      "xpath": "/html/body/nav/a[1]"
    },
    {
      "role": "link",
      "name": "About",
      "xpath": "/html/body/nav/a[2]"
    },
    {
      "role": "button",
      "name": "Login",
      "id": "login-button",
      "xpath": "/html/body/main/button"
    },
    {
      "role": "textbox",
      "name": "Email",
      "id": "email",
      "xpath": "/html/body/main/input"
    }
  ]
}

Implementation:

// Use browser's built-in accessibility tree
const tree = await page.accessibility.snapshot({
  interestingOnly: true  // Only interactive elements
});

// Automatically filters:
// - Elements with aria-hidden="true"
// - Elements with display:none
// - Ad/tracking iframes by domain
// - Non-semantic divs and spans

Example 2: API Response Filtering

Raw API responses often include internal metadata:

// Raw API response (2,000 tokens)
{
  "data": {
    "users": [
      {
        "id": "123",
        "name": "Alice",
        "email": "alice@example.com",
        "internalFlags": ["vip", "beta_tester"],
        "metadata": {
          "created_at": "2024-01-01",
          "updated_at": "2024-01-15",
          "version": 42
        }
      }
    ]
  },
  "_internal": {
    "requestId": "req-abc123",
    "latency": 45,
    "cache": "HIT",
    "debug": []
  },
  "_links": {
    "self": "/users",
    "next": "/users?page=2"
  }
}

Filter to semantic fields only:

function filterAPIResponse(response: any, schema: FieldSchema): any {
  const filtered = {};

  for (const field of schema.relevantFields) {
    if (response.data?.[field]) {
      filtered[field] = response.data[field];
    }
  }

  return filtered;
}

// Result (200 tokens):
{
  "users": [
    {
      "name": "Alice",
      "email": "alice@example.com"
    }
  ]
}

Example 3: Document Section Extraction

Full documents include boilerplate:

FULL DOCUMENT:
====================
COMPANY CONFIDENTIAL [Header repeated on every page]
Copyright 2024 Acme Corp. All rights reserved.

[Legal disclaimer spanning 3 pages]

EXECUTIVE SUMMARY
====================
The Q4 revenue increased by 15%...

[Navigation menu]
- Table of Contents
- Index
- Glossary

ACTUAL CONTENT STARTS HERE
====================
Analysis of market trends shows...

[50 more pages]

APPENDIX
========
[Technical specifications]
[Legal disclaimers]
[Contact information - repeated]

Extract semantic sections:

def extract_semantic_content(document: str) -> dict:
    # Skip headers, footers, navigation
    sections = {
        "executive_summary": extract_section(document, "EXECUTIVE SUMMARY"),
        "analysis": extract_section(document, "ANALYSIS"),
        "conclusions": extract_section(document, "CONCLUSIONS"),
    }

    # Remove boilerplate
    for section in sections.values():
        section = remove_legal_disclaimers(section)
        section = remove_navigation(section)

    return sections

# Result: Only the actual content, ~20% of original size

Architecture

graph LR A[Raw Data Source] --> B[Semantic Filter] B --> C[Clean Context] subgraph "Filter Layer" B --> D[Interactive Elements] B --> E[Relevant Fields] B --> F[Semantic Sections] end C --> G[LLM Processing] A -. "Noise removed" .-> B B -. "10-100x reduction" .-> C style C fill:#9f9,stroke:#333 style A fill:#f99,stroke:#333

Key Benefits

Aspect Raw Data Semantic Filter Improvement
Token count 10,000 100-1,000 10-100x reduction
LLM reasoning Confused by noise Focused on signal Better decisions
Cost High Low 10-100x cheaper
Latency Slow Fast 2-5x faster
Accuracy Prone to errors More reliable Higher success rate

How to use it

1. Identify Semantic Elements

For your domain, determine what actually matters:

// Web scraping: interactive elements only
const semanticElements = [
  'button', 'link', 'textbox', 'checkbox',
  'radio', 'combobox', 'slider'
];

// API responses: business data only
const relevantFields = [
  'name', 'email', 'status', 'amount'
];

// Documents: content sections only
const contentSections = [
  'executive_summary', 'analysis', 'conclusions'
];

2. Build Filter Layer

class SemanticFilter {
  filter(data: any, domain: string): any {
    switch (domain) {
      case 'web':
        return this.filterAccessibilityTree(data);
      case 'api':
        return this.filterAPIResponse(data);
      case 'document':
        return this.filterDocumentSections(data);
    }
  }

  private filterAccessibilityTree(dom: any): any {
    // Only interactive elements with ARIA roles
    return dom
      .filter(el => el.interactive)
      .filter(el => !el.isHidden)
      .filter(el => !this.isAdIframe(el))
      .map(el => ({
        role: el.role,
        name: el.name,
        xpath: el.xpath
      }));
  }
}

3. Apply Before LLM Call

// Wrong: Send raw data
const response = await llm.generate({
  prompt: `Analyze this page: ${rawHTML}`
});

// Right: Filter first
const filtered = semanticFilter.filter(rawHTML, 'web');
const response = await llm.generate({
  prompt: `Analyze this page: ${JSON.stringify(filtered)}`
});

4. Maintain Reference Mapping

Keep track of filtered-to-original mappings for execution:

interface FilteredElement {
  semanticId: string;    // "login-button"
  originalRef: string;   // "frameIndex-backendNodeId"
  xpath: string;         // "/html/body/main/button"
}

// Filtered context uses semantic IDs
const filteredContext = [
  { id: "btn-1", name: "Login", role: "button" }
];

// Execution layer maps back to original references
const element = mapToOriginal(filteredContext[0].id);
await page.click(element.xpath);

Trade-offs

Pros:

  • Dramatic token reduction: 10-100x smaller context
  • Better LLM reasoning: Focus on signal, not noise
  • Lower costs: Fewer tokens = cheaper
  • Faster inference: Smaller context = faster generation
  • Higher reliability: Less confusion and hallucination

Cons:

  • Filter complexity: Need to build and maintain filter logic
  • Information loss: May remove context that matters
  • Domain-specific: Filters need to be tailored per use case
  • Mapping overhead: Need to track filtered-to-original references
  • Potential bugs: Filter might remove important elements

Mitigation strategies:

  • Start conservative: Filter obvious noise, include borderline cases
  • Add filter bypass for debugging
  • Monitor LLM performance: Expand filter if accuracy drops
  • Version filters alongside data schemas
  • Provide hints to LLM: "Context has been filtered for relevance"

References