Jan HilgardWriting
← Back to writing

How to Build a Self-Healing Web Scraper with LLMs

A technical deep-dive into building an autonomous, self-repairing HTML parser pipeline using three parallel LLM witnesses, arbitration, and real-time semantic verification.

June 13, 2026 · 22 min read
The result card from a live extraction on data.hilgard.cz: overall confidence 1.00 (Verified) with a full green bar, a per-field table where product_name, price, currency, in_stock and image_url each score 1.00 alongside the controls that passed (2 sources, JSON-LD, AI context, Vision), and the extracted product fields below.
A real first-run on the live demo at data.hilgard.cz, parsing an Alza product page: Cloudflare's WAF detected and bypassed, every field cross-checked through several independent controls, and an overall confidence of 1.00. The same engine runs headless behind the production feeds.
On this page

The system, in one paragraph

You hand it a single URL — a product page, a real-estate listing — and it gives you back structured fields: name, price, currency, stock, barcode, image. The first time it sees a domain, it spends a minute studying the page: it reads the page three independent ways — the raw HTML, a clean markdown rendering, and a vision model looking at a screenshot — reconciles those three accounts field by field, then writes a small JavaScript parser for that site and repairs that parser in a loop until it reproduces the agreed-upon values exactly. It caches that parser per domain. Every later run skips the AI entirely — plain HTTP plus a DOM query, about a second — unless the cached parser's output fails validation, at which point the system notices the site has changed underneath it and re-learns from scratch. The whole thing is built around one rule: no silent errors. Every field comes back either verified, or null with an honest confidence score. It never returns a confidently-wrong value. You can watch it work, live, on a real URL of your choosing, at data.hilgard.cz.

That's the system. The rest of this post is about the parts that turned out to be more interesting than I expected — chiefly, that the hard problem in 2026 isn't extracting data from a page with an LLM. It's knowing whether the value you extracted is actually correct, and noticing when it quietly stops being correct three weeks later.


What this solves

Anyone who has shipped a scraper knows its real lifecycle. You write selectors against today's HTML, it works beautifully, you move on. Six weeks later the target site ships a redesign — a new wrapper div, a renamed price class, a JSON-LD block that now nests offers one level deeper — and your scraper doesn't crash. That would be merciful. It keeps running and returns the wrong number. A strikethrough original price instead of the sale price. A per-unit price instead of the pack price. A "from €X" range floor instead of the configured variant. The pipeline downstream trusts it, because why wouldn't it, and the bad value propagates into whatever the feed powers.

The traditional fix is a person: someone notices the numbers look off, opens the page, re-derives the selectors, ships a patch. This is slow, it's reactive, and it scales linearly with the number of sites you cover. It is also, in my experience, the single largest hidden cost in any data-feed product — not the initial parser, the gardening.

So the design goal here was never "use an LLM to read pages." LLMs read pages fine. The goal was to make wrongness loud. Every value the system emits has survived several independent checks that each had a real chance to reject it, and every value carries a confidence the consumer can act on. When the site changes, the cached parser's output stops passing those checks, and that failure is the trigger to re-learn — automatically, before a human notices. The "self-healing" in the title is just this: validation failure is wired directly to regeneration.

I want to be careful about the claim, though. This is not magic, and it is not free. It trades a large up-front cost — a minute of AI, a browser, a vision model — for a cheap, self-correcting steady state. For a handful of pages scraped once, that trade makes no sense; just write the selectors. The system earns its complexity only when you're covering many sites, for a long time, and the cost of a silently-wrong value is higher than the cost of the machinery that prevents it.


Why this exists

I build AI products that consume large amounts of public web data. The previous post in this series was about the layer underneath this one: a pool of LTE modems that puts requests on residential IPs so the pages load at all. This post is the layer on top — turning the raw HTML those requests retrieve into structured, trustworthy signal.

The two layers solve symmetric problems. The proxy pool exists because anti-bot systems are good at refusing datacenter IPs; the parser exists because the pages you finally reach are built for human eyes, not feeds, and they change without warning or notice. Getting the bytes is half the job. Believing the bytes is the other half, and it's the half nobody budgets for.

The concrete first vertical is Czech e-commerce and real estate — grocery and electronics retailers, property portals — but nothing in the design is specific to those. The whole point is that the system learns each site rather than being told about it.


Three independent witnesses

The first real idea in the pipeline is to never trust a single reading of the page.

When a new page comes in, it's loaded once in a stealth browser and then read three completely independent ways, in parallel:

  • The HTML witness. The page is condensed — JSON-LD, meta tags, an inventory of the embedded JSON-state blobs a modern site hydrates from, and the visible body — and handed to a language model that extracts the fields as text.
  • The markdown witness. The same page is rendered down to clean markdown — the markup noise stripped away, leaving a human-readable linearization — and a language model reads that. Besides its own field values, this pass reports where it found each one ("price: shown as '2 990 Kč' next to 'Koupit'"), and those locator notes are kept to guide the parser later.
  • The vision witness. A screenshot of the rendered page is handed to a vision-language model, which reads the fields the way a person would: the price is the big number near the buy button, not the crossed-out one above it.

These three see different things and fail differently. The HTML model can be fooled by a price buried in a priceSpecification for a different variant; the markdown reading can lose a value that only lived in an attribute or a script blob; the vision model can't read a barcode that isn't drawn on screen, and can't read a URL off an image at all. The value of having three is precisely that their mistakes don't correlate — and that two of them can outvote the third.

Before any reading is allowed to count, it passes through what I think of as the system's conscience: an anti-hallucination grounding step. Every scalar a model returns must appear verbatim in the page source — the price digits must be present in the text or a JSON blob, the barcode digits must actually occur, enough of the product-name tokens must be found. If a value can't be located in the source, it's dropped to null, no matter how confident the model sounded. A model that invents a plausible price is worse than useless; this step makes inventing one impossible to get away with.


Arbitration: reconciling the three accounts

Now you have three grounded witnesses that mostly agree and sometimes don't. Reconciling them, field by field, is the core of the whole system.

The easy cases are easy — it's a majority vote:

SituationOutcomeLabel
any value shared by ≥2 of the 3 witnessesthat value wins outrightconfirmed
only one witness has a value at allthat valuesingle
all present witnesses disagree→ conflict resolution

The interesting case is disagreement, and the resolution order matters because it's a deliberate hierarchy of how much each kind of evidence deserves to be trusted:

  1. Structured re-read (deterministic, no AI). Go back to the page's own machine-readable declarations — JSON-LD, microdata, OpenGraph — and read the field canonically. If that canonical value matches one of the witnesses, that witness wins. There's a subtlety here I had to learn the hard way: for product_name, currency, ean, and in_stock, the site's structured data is authoritative — if JSON-LD says it's in stock, it's in stock, regardless of a stale "Skladem" badge in the markup. But for price, JSON-LD is not authoritative. Shops routinely list a regular price in structured data and then sell at a discount, or per variant, in the visible DOM. So for price, a grounded witness is allowed to beat the structured value.
  2. AI judge (only if structured data didn't decide). The model is shown the page and the conflicting candidates and asked to pick one — or none — with an explicit instruction to choose the canonical selling value: not a per-unit price, not a crossed-out original, not a different variant, not a related product. Its pick is labeled arbitrated. This is the kind of judgment that's genuinely ambiguous from markup alone, and it's exactly where a model with the whole page in context earns its place.
  3. Unresolved → ambiguous. If nothing resolves it, the field is kept for display but excluded from the set the parser is required to reproduce. This matters more than it looks: an uncertain value is never promoted to "ground truth," so it can never be enforced as the thing the generated parser must match. Doubt is preserved honestly rather than papered over.

Only the confirmed, single, and arbitrated fields become required — the validation target. The whole arbitration step exists to produce a set of values the system is willing to stake its reputation on, and to be honest about the ones it isn't.

(image_url is handled separately and never comes from any of the three readings — vision can't read a URL off a screenshot, and the text passes tend to grab a thumbnail or a placeholder. The image URL comes only from the parser's DOM read, and is validated after the loop by actually downloading it and asking the vision model "is this a real product photo, or a logo / banner / placeholder?")


Generating a parser, then repairing it

Here's the move that makes the steady state cheap: the system doesn't keep an LLM in the extraction path. It uses the LLM, once, to write a program that does the extraction, and then runs that program forever after.

After arbitration produces the required values, the model is asked to write a small JavaScript DOM parser for this page, with a strong preference order: JSON-LD first, then embedded JSON state, then microdata, then — only as a last resort — CSS selectors. (Selectors are the brittle option; a parser that reads the site's own hydration state is far more stable across redesigns than one that depends on a class name surviving.) That parser is injected into the page and run, and its output is compared against the required ground-truth values.

If anything mismatches, it goes into a repair loop:

// Generate, then repair against ground truth — up to 5 iterations.
let parser = await model.writeParser(page, requiredFields);
 
for (let attempt = 1; attempt <= MAX_ITERATIONS; attempt++) {   // MAX_ITERATIONS = 5
  const result = await runInPage(page, parser);                 // inject via page.evaluate
  const diff = compare(result, groundTruth, requiredFields);
 
  if (diff.allMatch) break;                                     // parser reproduces every required field
 
  // Feed the specific mismatches back, not a vague "try again":
  //   "expected price 2990, got 1990 — you grabbed a monthly financing price;
  //    the selling price sits next to the buy button."
  parser = await model.repairParser(parser, diff, locatorHints);
}

The repair prompt is specific. The model gets its own previous code, the exact per-field diff ("expected X, got Y"), and locator hints from the witnesses about where the right value actually lives on the page. It rewrites only the logic for the fields that are wrong and leaves the correct ones alone. If the loop stalls — a couple of iterations with no progress — the system escalates to a stronger model for the markup it's struggling with, then comes back down.

The output of all this is not data. It's a parser, plus the metadata about how much to trust it, cached to parsers/<domain>.json. The expensive AI work has been spent once and frozen into a cheap, deterministic program.


Self-healing in production

The cache is where the "self-healing" actually happens, and it's almost anticlimactically simple — which is the point.

A later request for the same domain loads the cached parser and runs it. If the site's strategy is plain HTTP (more on that in a second), this is fast HTTP plus a DOM query in a lightweight parser — roughly a second, no browser, no AI.

But — and this is the crucial part, the one the whole SLA rests on — a parser that runs without throwing is not trusted to be right. This is the trap every scraper falls into: a parser that pulls the wrong number doesn't raise an error, it succeeds and lies — returns a stale price, a variant price, a strikethrough original — and the absence of an exception reads, falsely, as "all good." So a clean run is treated as a claim, not an answer. Every extraction — cache hits included — has to clear a fast validation pass before the value is allowed out: the deterministic canonical cross-check against the page's own structured data, the grounding, the barcode check digit, the AI context verification, the identity and locale checks. If a value can't pass, it does not ship — the run is marked not reliable and the parser is re-learned, even though it never errored.

The reason this can run on every record rather than only when something already looks wrong is that it's cheap — on the order of a hundred milliseconds, mostly deterministic, with the AI context check skipped entirely for any value that already equals the site's own structured data. That's the difference between a reliability SLA you can actually promise and one that's aspirational: the guarantee isn't "the parser didn't crash," it's "every field that shipped was independently re-verified on this run."

Here's the actual log from a cached run against an Alza product page — the whole thing took about a second, and you can see every guard still firing (the validated in 143ms line) before the value is allowed out:

▸ FAST PARSE · cached JS parser over plain HTTP (no browser, no AI)
📦 cache hit (got strategy) — parsers/www.alza.cz.json
▸ VALIDATE · every cache extraction is re-checked, never blindly trusted
   🔁 canonical cross-check vs JSON-LD/microdata: price ✓ · currency ✓ · in_stock ✓
   🔁 grounding: every value present in the fetched HTML
   🔎 validated in 143ms → OK
   🪪 identity gate: the page IS the product the URL asked for (no redirect/decoy)
   🌍 locale: currency CZK matches CZK expected for www.alza.cz
   📊 per-field confidence: product_name 1.00 · price 1.00 · currency 1.00 · in_stock 1.00 · image_url 1.00
⚡ fast path OK via got → name="AERIUM Dron R96X 4K Dual Camera GPS 3 baterie" price=3591 CZK stock=true (confidence 1.00)

So when the site redesigns and the cached parser starts pulling the wrong value out of a moved element, that value fails the canonical cross-check — it no longer equals what the page's JSON-LD declares — and the guard rejects it. That rejection is the self-heal trigger: the system logs that the cached parser has gone stale, falls all the way back to full generation, learns a fresh parser against the new layout, and overwrites the recipe. No human noticed. No wrong value shipped. The next request after that is fast again.

Failure-as-trigger is the whole mechanism. There's no separate "monitor the site for changes" system to maintain, because the validation that runs on every extraction is the change detector.

The fetch-strategy probe

One more thing gets cached alongside the parser: how to fetch the page at all. After a page parses reliably the first time, the system probes a plain HTTP request through the proxy and records what happens:

  • blocked or challenged → an anti-bot vendor is detected (DataDome, Cloudflare, …) and the strategy is set to the stealth browser — the only thing that gets through;
  • served, and the cached parser reproduces the same data from the raw HTML → the strategy is plain HTTP — no browser, no AI, the ~1-second path;
  • served, but the data only appears after JavaScript runs → the strategy is the browser, but without the anti-bot overhead.

That decision is stored and drives the fast path on every later run. The system learns not just how to read each site but how cheaply it can afford to fetch it.


The contract: no silent errors

Everything above exists to back one promise to whoever consumes the feed. The production output is a deliberately small, stable envelope — the clean fields and nothing else:

{
  "success": true,          // produced a product (false on error / blocked)
  "reliable": true,         // parser matched ground truth, context-verified, identity ok
  "scraped_at": "2026-06-08T15:45:19.536Z",
  "duration_ms": 2754,
  "security": { "level": "none", "vendor": null, "fetch": "got" },
  "entity": "product",      // product | realestate | realestate_index
  "product": {
    "url": "…", "domain": "…",
    "product_name": "…", "price": 18990, "currency": "CZK",
    "in_stock": true, "ean": "6932554405373", "image_url": "…"
  }
}

All the internal telemetry — the per-field confidences, the arbitration log, the verification trace — is suppressed by default and only emitted under a single debug key when explicitly asked for. A production consumer gets zero noise. A field that couldn't be verified comes back null, never as a guess. And the reliable flag is a hard summary: it's only true when the parser reproduced the agreed ground truth, and every required field was context-verified on this run, and the page passed the identity gate.

Underneath, every field carries a 0–1 confidence built from independent positive proofs — the philosophy being that passing checks should raise confidence rather than that failing ones lower it. A value that exactly equals the site's own JSON-LD is deterministic, machine-readable ground truth the site declared about itself, so it scores a full 1.0 outright. Otherwise the score starts from how the value was settled and accrues bonuses for each check it survives:

base by arbitration outcome     confirmed 0.90 · arbitrated 0.80 · single 0.60
                                 unverified 0.30 · ambiguous 0.20

+ 0.25  valid GTIN check digit (a real barcode, not a SKU)
+ 0.20  vision confirmed a real product photo
+ 0.10  AI context-box confirmed the value in its own DOM block
+ 0.05  corroborated elsewhere (e.g. EAN digits also in the image URL)
× 0.50  if context verification actively failed

canonical JSON-LD match  →  1.00  (overrides the above)

The overall score is a weighted mean that lets the fields that matter dominate — name and price weigh 3, stock and currency 2, barcode and image 1 — so a weak ancillary field (a barcode with no canonical cross-check, say) can't drag an otherwise fully-verified result down into "doubt" territory, while the per-field numbers still expose that weak field honestly to anyone who looks.

A small detail I'm fond of: the barcode is run through the actual GTIN check-digit algorithm. A 13-digit string that looks like an EAN but fails its checksum is a shop's internal product code, not a barcode, and it's rejected to null. It's a deterministic test that costs nothing and catches a whole category of plausible garbage.


Beyond e-commerce: the same system, no per-site code

The cleanest validation of the design was pointing it at a completely different vertical — real-estate listings — and changing essentially nothing architecturally.

Property portals are, if anything, easier in one respect: many are built on modern frameworks that ship the entire listing as a hydration-state blob in the page, so the generated parser is mostly deterministic JSON traversal rather than fragile selector-hunting. But the rule I held to is the one that keeps the whole thing maintainable: no portal-specific code, ever. The system is never told "on this site, the price is in this field." It works by meaning and by value — the same three-witness-then-arbitrate flow it uses for products — and discovers each portal's shape itself. A listing has more fields than a product (transaction type, property kind, disposition, usable area, land area, locality, advert id), but the machinery reconciling and verifying them is identical.

That's the real test of whether you built a system or just a clever scraper for one site. If adding a vertical means writing per-site parsing rules, you built the scraper. If it means pointing the same pipeline at a new kind of page and letting it learn, you built the system. I'm reasonably happy this one is the latter — though I'd add that "no per-site code" is a discipline you have to keep choosing, because the temptation to just hardcode the one annoying portal is always there.


What I'd do differently, and where this stops making sense

A few honest limits, in the spirit of the LTE post.

The first run is genuinely expensive. A full generation — browser, screenshot, three model passes, arbitration, a repair loop — is tens of seconds, sometimes more on awkward markup. The architecture is explicitly a bet that you'll read each domain many times, so that cost amortizes to near zero. If your workload is "scrape these 50 URLs once," this is the wrong tool by a wide margin; the steady-state economics that justify it never kick in.

It's only as honest as its grounding corpus. The anti-hallucination check compares against the page source the system actually captured. If a page renders a value into a canvas, an image, or some exotic shadow-DOM arrangement that doesn't land in the corpus, a real value can get dropped to null. I'd rather the system under-claim than over-claim — a null is a known unknown, a wrong number is a hidden one — but it does mean the failure mode is "missing field," and that's a real cost on some pages.

The repair loop can fail to converge. Five iterations and an escalation isn't unlimited. On genuinely hostile or pathological markup the loop can exhaust itself without producing a parser that reproduces every required field, and the result comes back reliable: false. That's the contract working as designed — better an honest "not reliable" than a fabricated success — but it's not nothing, and on a few sites it happens more than I'd like.

Where it stops making sense: the same place every "learn it with AI" system does. If a site is small, stable, and you control the parsing budget, hand-written selectors are cheaper, faster, and easier to reason about. This system's value is strictly a function of churn — how often layouts change and how many of them you track. Below some threshold of scale and volatility, the machinery costs more than the gardening it replaces. Above it, the math flips hard, and the self-healing loop is the difference between a feed you maintain and a feed that maintains itself.


See it run

The code isn't open source, but the system is live and you can drive it yourself. At data.hilgard.cz there's a console: paste a Czech e-commerce or real-estate URL, and it streams the pipeline phase by phase — classify, ground truth, arbitrate, generate, repair, verify — and renders the final JSON with a security badge and a confidence chip on every field. It's the same engine described here, running headless behind the production feeds; the demo just exposes the phased log that's normally suppressed.

The live demo on data.hilgard.cz mid-run: a URL input with an Alza product link, example chips for various sites, and the streaming progress console showing the verification and quality-assessment phases — context-box verification confirming each value in its own DOM block, and the identity gate checking the page is the product the URL asked for.
The live console mid-run on an Alza product page. Each phase streams as it happens; here the verification stage is pinning every extracted value to its own DOM block, so a coincidental match elsewhere on the page can't pass.

If you paste a URL from a site it has seen before, you'll get the ~1-second cache path. Paste something new and you'll watch it learn the page from scratch. Either way, the thing to watch is the confidence on each field — that number is the whole point of the system, and everything in this post exists to make it mean something.


This is the second in a series on production infrastructure for AI products. The first was the LTE residential-proxy pool that gets the requests through; this one is the parser that turns the retrieved HTML into trustworthy structured data. The thread connecting them is the same: the unglamorous, load-bearing infrastructure underneath an AI product is where most of the real engineering actually lives.

ai-productsscrapingparsingllmself-healing