Tai Huynh
Engineering notes← back to demo

How this scraper actually works.

One Vercel deploy, no separate backend. Every scrape runs as a Node.js serverless function with a 60-second budget. Below: the eleven steps between a URL and a typed table, plus the non-obvious engineering decisions and the gaps that are honestly disclosed.

The pipeline, step by step.

  1. 01

    Rate limit

    Upstash sliding window — 10 requests per IP per day. Cheap, stateless, no auth.

  2. 02

    SSRF guard

    Scheme allowlist (http / https); literal block on localhost / metadata.google.internal; DNS resolve every host and reject any private / loopback / link-local IP — including 169.254.169.254 (cloud metadata). Re-checked on every redirect.

  3. 03

    Domain blocklist

    Banking, government, social-login, and adult domains rejected up front. Defensive default.

  4. 04

    Robots.txt

    Fetched (through the same SSRF guard) and cached 24h. Disallow → 403, no exceptions.

  5. 05

    Fetch HTML

    8s timeout, 2 MB body cap, manual redirect (max 3 hops, SSRF re-checked each hop), custom User-Agent.

  6. 06

    SPA detect

    Heuristic — empty body + presence of #root, #__next, or #app → graceful error pointing to the headless-browser fallback (Playwright on Cloud Run, in the full client build).

  7. 07

    Clean HTML

    Cheerio strips script, style, nav, footer, header, aside, inline event handlers, and inline styles. Listing grids stay intact — Readability is deliberately not used because it would strip them.

  8. 08

    NL → field list

    Fast path: comma-separated names parsed by regex, no LLM. Slow path: a short Gemini call. Hashed and cached in Upstash for 7 days — same input never burns tokens twice.

  9. 09

    Build Gemini responseSchema

    Hand-built OpenAPI subset: { type: 'array', items: { … nullable: true } }. Always an array, even for single-record pages, so the UI never branches on cardinality. Gemini rejects full JSON Schema (no $ref / anyOf, nullable must be nullable: true) — sanitized before sending.

  10. 10

    Extract via Gemini 2.5 Flash

    Structured-output JSON, temperature 0, 25s timeout. The system prompt forbids hallucination: missing field → null.

  11. 11

    Validate + return

    Type-coerce per field (numbers stripped of currency, booleans normalized), zod-validated as nullable, returned with token + size telemetry for the table footer.

Next step

Want this for your business?

This demo is a portfolio piece, but the architecture is the real client build — minus the headless tier and the pinned-IP agent, both wired but off by default. If you have a compliance-sane scraping problem and a target list, email me with what you’re extracting, from where, and how often. I’ll reply within 24 hours.