Skip to content

Core · Service · Active scaffold

platform-scraper-service

Core service-to-service crawler runtime that accepts crawl requests, renders pages with Playwright, stores crawl state in Redis and delivers results through callbacks.

  • TypeScript
  • NestJS 11
  • BullMQ
  • Redis
  • Playwright
  • @platform/contracts-scraper

Spec sheet

Boundary

Core / Scraping

Runtime

NestJS 11 HTTP service plus BullMQ processor

Default port

3800 in local stack, 3700 standalone

Proxy host

http://scraper.cs.lvh.me:8080

Queue/state

BullMQ and Redis

Security

Internal crawl API guarded by x-platform-internal-token

Responsibilities

  • Accept internal crawl creation requests with callback configuration.
  • Render pages through a browser runtime and extract title, text and links.
  • Follow only same-origin URLs under the root path prefix.
  • Persist crawl lifecycle state from QUEUED/RUNNING to terminal statuses.
  • Deliver completion or failure payloads to service-to-service callbacks.
  • Protect the platform from unsafe local, private and reserved network targets.

Interfaces and contract surface

  • GET /health
  • POST /internal/scraper/crawls
  • GET /internal/scraper/crawls/:crawlId

Consumers

Dependencies and external touchpoints

Notes

  • Frontend applications must not call this service directly.
  • The crawler blocks localhost, private networks and reserved ranges unless a local test explicitly opts in.
  • The callback receiver is responsible for owning product-specific extraction and persistence.

Source references

  • platform-scraper-service/README.md
  • platform-scraper-service/src/scraper
  • platform-scraper-service/package.json

Integrazione rapida

platform-scraper-service e una surface service-to-service. I frontend non devono chiamarla direttamente.

Variabili utili

bash
export SCRAPER_URL=http://127.0.0.1:3800
export INTERNAL_TOKEN=platform-local-stack-internal-token

Health

bash
curl -sS "$SCRAPER_URL/health"

Provider disponibili

bash
curl -sS "$SCRAPER_URL/internal/scraper/crawls/providers" \
  -H "x-platform-internal-token: $INTERNAL_TOKEN"

Il provider di default e playwright. Per usare Firecrawl configura FIRECRAWL_API_KEY e passa "providerKey": "firecrawl" nella richiesta di crawl. FIRECRAWL_BASE_URL, FIRECRAWL_POLL_ATTEMPTS e FIRECRAWL_POLL_INTERVAL_MS permettono di puntare un endpoint self-hosted o regolare il polling.

Crea crawl

bash
curl -sS -X POST "$SCRAPER_URL/internal/scraper/crawls" \
  -H "Content-Type: application/json" \
  -H "x-platform-internal-token: $INTERNAL_TOKEN" \
  -d '{
    "startUrl": "https://example.com/",
    "providerKey": "playwright",
    "callback": {
      "url": "https://example.com/platform-scraper-callback"
    },
    "maxDepth": 0,
    "maxPages": 1,
    "includeHtml": false,
    "textMaxCharsPerPage": 2000
  }'

La risposta contiene crawlId, providerKey e status: "QUEUED".

Leggi stato crawl

bash
curl -sS "$SCRAPER_URL/internal/scraper/crawls/<CRAWL_ID>" \
  -H "x-platform-internal-token: $INTERNAL_TOKEN"

Note operative

  • Il crawler accetta solo URL http e https.
  • Il provider registry espone playwright sempre disponibile e firecrawl disponibile solo quando FIRECRAWL_API_KEY e configurato.
  • Localhost, reti private e range riservati sono bloccati dai default di sicurezza.
  • Il runtime Playwright usa un profilo browser Chromium configurabile con SCRAPER_BROWSER_USER_AGENT, SCRAPER_BROWSER_LOCALE, SCRAPER_BROWSER_TIMEZONE_ID, SCRAPER_BROWSER_ACCEPT_LANGUAGE, SCRAPER_BROWSER_VIEWPORT_WIDTH, SCRAPER_BROWSER_VIEWPORT_HEIGHT e SCRAPER_BROWSER_HEADLESS.
  • Un 403 o 429 nella pagina crawled indica blocco del sito target; il servizio deve riportarlo nel risultato senza introdurre bypass o logica specifica del dominio target.
  • Il risultato arriva alla callback configurata con x-platform-internal-token.
  • In locale il servizio usa Redis e Playwright/Chromium.

Workspace reference: /Users/jeanpaul/projects/cs-repository