Skip to content

Core · Service · Active scaffold

platform-scraper-service

Core service-to-service crawler runtime that accepts crawl requests, renders pages with Playwright, stores crawl state in Redis and delivers results through callbacks.

  • TypeScript
  • NestJS 11
  • BullMQ
  • Redis
  • Playwright
  • @platform/contracts-scraper

Spec sheet

Boundary

Core / Scraping

Runtime

NestJS 11 HTTP service plus BullMQ processor

Default port

3800 in local stack, 3700 standalone

Proxy host

http://scraper.cs.lvh.me:8080

Queue/state

BullMQ and Redis

Security

Internal crawl API guarded by x-platform-internal-token

Responsibilities

  • Accept internal crawl creation requests with callback configuration.
  • Render pages through a browser runtime and extract title, text and links.
  • Follow only same-origin URLs under the root path prefix.
  • Persist crawl lifecycle state from QUEUED/RUNNING to terminal statuses.
  • Deliver completion or failure payloads to service-to-service callbacks.
  • Protect the platform from unsafe local, private and reserved network targets.

Interfaces and contract surface

  • GET /health
  • GET /internal/scraper/crawls/providers
  • POST /internal/scraper/crawls
  • GET /internal/scraper/crawls/:crawlId

Consumers

Dependencies and external touchpoints

Notes

  • Frontend applications must not call this service directly.
  • The crawler blocks localhost, private networks and reserved ranges unless a local test explicitly opts in.
  • The callback receiver is responsible for owning product-specific extraction and persistence.

Source references

  • platform-scraper-service/README.md
  • platform-scraper-service/src/scraper
  • platform-scraper-service/package.json

Integrazione rapida

platform-scraper-service e una surface service-to-service. I frontend non devono chiamarla direttamente.

Variabili utili

bash
export SCRAPER_URL=http://127.0.0.1:3800
export INTERNAL_TOKEN=platform-local-stack-internal-token

Health

bash
curl -sS "$SCRAPER_URL/health"

Provider disponibili

bash
curl -sS "$SCRAPER_URL/internal/scraper/providers" \
  -H "x-platform-internal-token: $INTERNAL_TOKEN"

Il provider di default e playwright. Sono disponibili playwright, crawlee e firecrawl. La configurazione runtime e persistita in Postgres tramite SCRAPER_DATABASE_URL. Il provider crawlee usa Crawlee con runtime Playwright, storage in memoria per singolo crawl e gli stessi controlli di safety, scope e robots del provider interno. Per usare Firecrawl puoi impostare FIRECRAWL_API_KEY al bootstrap o salvarla dalla console status, poi passare "providerKey": "firecrawl" nella richiesta di crawl. FIRECRAWL_BASE_URL, FIRECRAWL_POLL_ATTEMPTS e FIRECRAWL_POLL_INTERVAL_MS permettono di puntare un endpoint self-hosted o regolare il polling.

Crea crawl

bash
curl -sS -X POST "$SCRAPER_URL/internal/scraper/crawls" \
  -H "Content-Type: application/json" \
  -H "x-platform-internal-token: $INTERNAL_TOKEN" \
  -d '{
    "startUrl": "https://example.com/",
    "providerKey": "playwright",
    "callback": {
      "url": "https://example.com/platform-scraper-callback"
    },
    "maxDepth": 0,
    "maxPages": 1,
    "includeHtml": false,
    "textMaxCharsPerPage": 2000
  }'

La risposta contiene crawlId, providerKey e status: "QUEUED".

Leggi stato crawl

bash
curl -sS "$SCRAPER_URL/internal/scraper/crawls/<CRAWL_ID>" \
  -H "x-platform-internal-token: $INTERNAL_TOKEN"

Note operative

  • Il crawler accetta solo URL http e https.
  • Il provider registry espone playwright, crawlee e firecrawl; stato enabled, default e credenziali Firecrawl sono persistiti in Postgres.
  • Localhost, reti private e range riservati sono bloccati dai default di sicurezza.
  • Il runtime Playwright usa un profilo browser Chromium configurabile con SCRAPER_BROWSER_USER_AGENT, SCRAPER_BROWSER_LOCALE, SCRAPER_BROWSER_TIMEZONE_ID, SCRAPER_BROWSER_ACCEPT_LANGUAGE, SCRAPER_BROWSER_VIEWPORT_WIDTH, SCRAPER_BROWSER_VIEWPORT_HEIGHT e SCRAPER_BROWSER_HEADLESS.
  • Un 403 o 429 nella pagina crawled indica blocco del sito target; il servizio lo riporta nella pagina con statusCode ed error senza introdurre bypass o logica specifica del dominio target.
  • Quando gli header del target identificano una protezione anti-bot nota, il payload include blockType; x-datadome viene normalizzato come blockType: "datadome" sulla pagina e sul payload di crawl.
  • Se tutte le pagine visitate falliscono, il risultato del crawl e FAILED; i fallimenti parziali restano COMPLETED con metrics.failedCount valorizzato.
  • Il risultato arriva alla callback configurata con x-platform-internal-token.
  • In locale il servizio usa Redis e Playwright/Chromium.

Endpoint reference

Header comuni per le route interne:

  • x-platform-internal-token: $INTERNAL_TOKEN
  • Content-Type: application/json sulle POST
EndpointScopoRequestEsempioRisposta rappresentativaErrori e consumer
GET /healthVerifica liveness del servizio scraper.Nessun body.curl -sS "$SCRAPER_URL/health"{"status":"ok","service":"platform-scraper-service"}5xx se runtime non disponibile. Consumer: status dashboard e smoke locali.
GET /internal/scraper/providersElenca provider crawl disponibili e default.Header interno.curl -sS "$SCRAPER_URL/internal/scraper/providers" -H "x-platform-internal-token: $INTERNAL_TOKEN"{"defaultProviderKey":"playwright","items":[{"key":"playwright","label":"Playwright","enabled":true,"configured":true,"status":"available","default":true},{"key":"crawlee","label":"Crawlee Playwright crawler","enabled":true,"configured":true,"status":"available","default":false},{"key":"firecrawl","label":"Firecrawl","enabled":true,"configured":false,"status":"unconfigured","default":false}]}401/403 token. Consumer: BFF prodotto, operations e MCP scraper diagnostics.
GET /internal/scraper/providers/:providerKeyDettaglio admin provider, senza secret in chiaro.Path providerKey.curl -sS "$SCRAPER_URL/internal/scraper/providers/firecrawl" -H "x-platform-internal-token: $INTERNAL_TOKEN"{"provider":{"key":"firecrawl","apiKeyConfigured":true,"baseUrl":"https://api.firecrawl.dev"},"defaultProviderKey":"playwright"}404 provider sconosciuto, 401/403 token. Consumer: status console.
PUT /internal/scraper/providers/:providerKeyAggiorna configurazione provider persistita.Opzionali enabled, makeDefault, apiKey, baseUrl.curl -sS -X PUT "$SCRAPER_URL/internal/scraper/providers/firecrawl" -H "Content-Type: application/json" -H "x-platform-internal-token: $INTERNAL_TOKEN" -d '{"enabled":true,"makeDefault":true,"apiKey":"fc_x"}'{"provider":{"key":"firecrawl","status":"available"},"defaultProviderKey":"firecrawl"}400 payload non valido, 404 provider sconosciuto, 401/403 token. Consumer: status console.
GET /internal/scraper/crawls/providersAlias legacy della lista provider crawl.Header interno.curl -sS "$SCRAPER_URL/internal/scraper/crawls/providers" -H "x-platform-internal-token: $INTERNAL_TOKEN"Stesso payload di GET /internal/scraper/providers.401/403 token. Consumer legacy.
POST /internal/scraper/crawlsAccoda un crawl e registra callback di completamento.startUrl, callback.url; opzionali providerKey, maxDepth, maxPages, concurrency, includeHtml, textMaxCharsPerPage.curl -sS -X POST "$SCRAPER_URL/internal/scraper/crawls" -H "Content-Type: application/json" -H "x-platform-internal-token: $INTERNAL_TOKEN" -d '{"startUrl":"https://example.com/","providerKey":"playwright","callback":{"url":"http://127.0.0.1:3004/api/internal/scraper/events"},"maxDepth":0,"maxPages":1,"includeHtml":false,"textMaxCharsPerPage":2000}'{"crawlId":"crawl_123","status":"QUEUED","providerKey":"playwright"}400 URL o limiti non validi, 403 target privato/bloccato, 503 provider non configurato. Consumer: BFF/worker prodotto e MCP tool platform_scraper_create_crawl.
GET /internal/scraper/crawls/:crawlIdLegge dettaglio crawl, pagine, skipped URL e metriche.Path crawlId.curl -sS "$SCRAPER_URL/internal/scraper/crawls/crawl_123" -H "x-platform-internal-token: $INTERNAL_TOKEN"{"crawlId":"crawl_123","status":"COMPLETED","providerKey":"playwright","startUrl":"https://example.com/","callbackUrl":"http://127.0.0.1:3004/api/internal/scraper/events","root":{"origin":"https://example.com","pathPrefix":"/"},"pages":[{"url":"https://example.com/","finalUrl":"https://example.com/","depth":0,"statusCode":200,"title":"Example","text":"Example Domain","links":[]}],"skipped":[],"metrics":{"visitedCount":1,"failedCount":0,"skippedCount":0,"startedAt":"2026-05-03T10:00:00.000Z","completedAt":"2026-05-03T10:00:01.000Z"},"createdAt":"2026-05-03T10:00:00.000Z","updatedAt":"2026-05-03T10:00:01.000Z"}404 crawl inesistente, 401/403 token. Consumer: BFF prodotto, operations e retry callback.

Contratto callback

Il receiver prodotto deve accettare una POST service-to-service con x-platform-internal-token e payload:

json
{
  "crawlId": "crawl_123",
  "status": "COMPLETED",
  "providerKey": "playwright",
  "startUrl": "https://example.com/",
  "root": {
    "origin": "https://example.com",
    "pathPrefix": "/"
  },
  "pages": [
    {
      "url": "https://example.com/",
      "finalUrl": "https://example.com/",
      "depth": 0,
      "statusCode": 200,
      "title": "Example",
      "text": "Example Domain",
      "links": []
    }
  ],
  "skipped": [],
  "metrics": {
    "visitedCount": 1,
    "failedCount": 0,
    "skippedCount": 0,
    "startedAt": "2026-05-03T10:00:00.000Z",
    "completedAt": "2026-05-03T10:00:01.000Z"
  },
  "error": null
}

Per un blocco target noto, la pagina fallita mantiene l'errore HTTP e aggiunge la diagnosi strutturata:

json
{
  "status": "FAILED",
  "blockType": "datadome",
  "pages": [
    {
      "url": "https://www.immobiliare.it/vendita-case/voghera/",
      "finalUrl": "https://www.immobiliare.it/vendita-case/voghera/",
      "depth": 0,
      "statusCode": 403,
      "title": null,
      "text": null,
      "links": [],
      "error": "Target blocked the request with HTTP 403.",
      "blockType": "datadome"
    }
  ],
  "error": "Target blocked the request with HTTP 403."
}

Workspace reference: /Users/jeanpaul/projects/cs-repository