How this scraper actually works.
One Vercel deploy, no separate backend. Every scrape runs as a Node.js serverless function with a 60-second budget. Below: the eleven steps between a URL and a typed table, plus the non-obvious engineering decisions and the gaps that are honestly disclosed.
The pipeline, step by step.
- 01
Rate limit
Upstash sliding window — 10 requests per IP per day. Cheap, stateless, no auth.
- 02
SSRF guard
Scheme allowlist (
http/https); literal block onlocalhost/metadata.google.internal; DNS resolve every host and reject any private / loopback / link-local IP — including169.254.169.254(cloud metadata). Re-checked on every redirect. - 03
Domain blocklist
Banking, government, social-login, and adult domains rejected up front. Defensive default.
- 04
Robots.txt
Fetched (through the same SSRF guard) and cached 24h. Disallow → 403, no exceptions.
- 05
Fetch HTML
8s timeout, 2 MB body cap, manual redirect (max 3 hops, SSRF re-checked each hop), custom User-Agent.
- 06
SPA detect
Heuristic — empty body + presence of
#root,#__next, or#app→ graceful error pointing to the headless-browser fallback (Playwright on Cloud Run, in the full client build). - 07
Clean HTML
Cheerio strips script, style, nav, footer, header, aside, inline event handlers, and inline styles. Listing grids stay intact — Readability is deliberately not used because it would strip them.
- 08
NL → field list
Fast path: comma-separated names parsed by regex, no LLM. Slow path: a short Gemini call. Hashed and cached in Upstash for 7 days — same input never burns tokens twice.
- 09
Build Gemini responseSchema
Hand-built OpenAPI subset:
{ type: 'array', items: { … nullable: true } }. Always an array, even for single-record pages, so the UI never branches on cardinality. Gemini rejects full JSON Schema (no$ref/anyOf, nullable must benullable: true) — sanitized before sending. - 10
Extract via Gemini 2.5 Flash
Structured-output JSON, temperature 0, 25s timeout. The system prompt forbids hallucination: missing field →
null. - 11
Validate + return
Type-coerce per field (numbers stripped of currency, booleans normalized), zod-validated as nullable, returned with token + size telemetry for the table footer.
Want this for your business?
This demo is a portfolio piece, but the architecture is the real client build — minus the headless tier and the pinned-IP agent, both wired but off by default. If you have a compliance-sane scraping problem and a target list, email me with what you’re extracting, from where, and how often. I’ll reply within 24 hours.