platform-scraper-service

Spec sheet

Boundary

Core / Scraping

Runtime

NestJS 11 HTTP service plus BullMQ processor

Default port

3800 in local stack, 3700 standalone

Proxy host

http://scraper.cs.lvh.me:8080

Queue/state

BullMQ and Redis

Security

Internal crawl API guarded by x-platform-internal-token

Responsibilities

Accept internal crawl creation requests with callback configuration.
Render pages through a browser runtime and extract title, text and links.
Follow only same-origin URLs under the root path prefix.
Persist crawl lifecycle state from QUEUED/RUNNING to terminal statuses.
Deliver completion or failure payloads to service-to-service callbacks.
Protect the platform from unsafe local, private and reserved network targets.

Interfaces and contract surface

GET /health
POST /internal/scraper/crawls
GET /internal/scraper/crawls/:crawlId

Consumers

platform-mcp-service
Backend/BFF services

Dependencies and external touchpoints

platform-local-stack
@platform/contracts-scraper
Redis
Playwright Chromium runtime

Notes

Frontend applications must not call this service directly.
The crawler blocks localhost, private networks and reserved ranges unless a local test explicitly opts in.
The callback receiver is responsible for owning product-specific extraction and persistence.

Source references

platform-scraper-service/README.md
platform-scraper-service/src/scraper
platform-scraper-service/package.json

Integrazione rapida

platform-scraper-service e una surface service-to-service. I frontend non devono chiamarla direttamente.

Variabili utili

bash

export SCRAPER_URL=http://127.0.0.1:3800
export INTERNAL_TOKEN=platform-local-stack-internal-token

Health

bash

curl -sS "$SCRAPER_URL/health"

Provider disponibili

bash

curl -sS "$SCRAPER_URL/internal/scraper/crawls/providers" \
  -H "x-platform-internal-token: $INTERNAL_TOKEN"

Il provider di default e playwright. Per usare Firecrawl configura FIRECRAWL_API_KEY e passa "providerKey": "firecrawl" nella richiesta di crawl. FIRECRAWL_BASE_URL, FIRECRAWL_POLL_ATTEMPTS e FIRECRAWL_POLL_INTERVAL_MS permettono di puntare un endpoint self-hosted o regolare il polling.

Crea crawl

bash

curl -sS -X POST "$SCRAPER_URL/internal/scraper/crawls" \
  -H "Content-Type: application/json" \
  -H "x-platform-internal-token: $INTERNAL_TOKEN" \
  -d '{
    "startUrl": "https://example.com/",
    "providerKey": "playwright",
    "callback": {
      "url": "https://example.com/platform-scraper-callback"
    },
    "maxDepth": 0,
    "maxPages": 1,
    "includeHtml": false,
    "textMaxCharsPerPage": 2000
  }'

La risposta contiene crawlId, providerKey e status: "QUEUED".

Leggi stato crawl

bash

curl -sS "$SCRAPER_URL/internal/scraper/crawls/<CRAWL_ID>" \
  -H "x-platform-internal-token: $INTERNAL_TOKEN"

Note operative

Il crawler accetta solo URL http e https.
Il provider registry espone playwright sempre disponibile e firecrawl disponibile solo quando FIRECRAWL_API_KEY e configurato.
Localhost, reti private e range riservati sono bloccati dai default di sicurezza.
Il runtime Playwright usa un profilo browser Chromium configurabile con SCRAPER_BROWSER_USER_AGENT, SCRAPER_BROWSER_LOCALE, SCRAPER_BROWSER_TIMEZONE_ID, SCRAPER_BROWSER_ACCEPT_LANGUAGE, SCRAPER_BROWSER_VIEWPORT_WIDTH, SCRAPER_BROWSER_VIEWPORT_HEIGHT e SCRAPER_BROWSER_HEADLESS.
Un 403 o 429 nella pagina crawled indica blocco del sito target; il servizio deve riportarlo nel risultato senza introdurre bypass o logica specifica del dominio target.
Il risultato arriva alla callback configurata con x-platform-internal-token.
In locale il servizio usa Redis e Playwright/Chromium.

Contratti

Client e runtime

GitHub CLI gh

Railway CLI

OpenAI API e modelli

Retrieval-Augmented Generation

Spec sheet

Responsibilities

Interfaces and contract surface

Consumers

Dependencies and external touchpoints

Notes

Source references

Integrazione rapida

Variabili utili

Health

Provider disponibili

Crea crawl

Leggi stato crawl

Note operative

Spec sheet

Responsibilities

Interfaces and contract surface

Consumers

Dependencies and external touchpoints

Notes

Source references

Integrazione rapida ​

Variabili utili ​

Health ​

Provider disponibili ​

Crea crawl ​

Leggi stato crawl ​

Note operative ​

Integrazione rapida

Variabili utili

Health

Provider disponibili

Crea crawl

Leggi stato crawl

Note operative