feat(consumer-prices): add basket price monitoring domain (#1901)

* feat(consumer-prices): add basket price monitoring domain

Adds end-to-end consumer price tracking to enable inflation monitoring
across key markets, starting with UAE essentials basket.

- consumer-prices-core/: companion scraping service with pluggable
  acquisition providers (Playwright, Exa, Firecrawl, Parallel P0),
  config-driven retailer YAML, Postgres schema, Redis snapshots
- proto/worldmonitor/consumer_prices/v1/: 6-RPC service definition
- api/consumer-prices/v1/[rpc].ts: Vercel edge route
- server/worldmonitor/consumer-prices/v1/: Redis-backed RPC handlers
- src/services/consumer-prices/: circuit breakers + bootstrap hydration
- src/components/ConsumerPricesPanel.ts: 5-tab panel (overview /
  categories / movers / spread / health)
- scripts/seed-consumer-prices.mjs: Railway cron seed script
- Wire into bootstrap, health, panels, gateway, cache-keys, locale

* fix(consumer-prices): resolve all code review findings

P0: populate topCategories — categoryResult was fetched but never used.
Added buildTopCategories() helper with grouped CTE query that extracts
current_index and week-over-week pct per category.

P1 (4 fixes):
- aggregate: replace N+1 getBaselinePrice loop with single batch query
  getBaselinePrices(ids[], date) via ANY($1) — eliminates 119 DB roundtrips
  per basket run
- aggregate/computeValueIndex: was dividing all category floors by the same
  arbitrary first baseline; now uses per-item floor price with per-item
  baseline (same methodology as fixed index but with cheapest price)
- basket-series endpoint now seeded: added buildBasketSeriesSnapshot() to
  worldmonitor.ts, /basket-series route in companion API, publish.ts writes
  7d/30d/90d series per basket, seed script fetches and writes all three ranges
- scrape: call teardownAll() after each retailer run to close Playwright
  browser; without this the Chromium process leaked on Railway

P2 (4 fixes):
- db/client: remove rejectUnauthorized: false — was bypassing TLS cert
  validation on all non-localhost connections
- publish: seed-meta now writes { fetchedAt, recordCount } matching the format
  expected by _seed-utils.mjs writeExtraKeyWithMeta (was writing { fetchedAt, key })
- products: remove unused getMatchedProductsForBasket — exact duplicate of
  getBasketRows in aggregate.ts; never imported by anything

Snapshot type overhaul:
- Flatten WMOverviewSnapshot to match proto GetConsumerPriceOverviewResponse
  (was nested under overview:{}; handlers read flat)
- All asOf fields changed from number to string (int64 → string per proto JSON)
- freshnessMin/parseSuccessRate null -> 0 defaults
- lastRunAt changed from epoch number to ISO string
- Mover items now include currentPrice and currencyCode
- emptyOverview/Movers/Spread/Freshness in seed script use String(Date.now())

* feat(consumer-prices): wire Exa search engine as acquisition backend for UAE retailers

Ports the proven Exa+summary price extraction from PR #1904 (seed-grocery-basket.mjs)
into consumer-prices-core as ExaSearchAdapter, replacing unvalidated Playwright CSS
scraping for all three UAE retailers (Carrefour, Lulu, Noon).

- New ExaSearchAdapter: discovers targets from basket YAML config (one per item),
  calls Exa API with contents.summary to get AI-extracted prices, uses matchPrice()
  regex (ISO codes + symbol fallback + CURRENCY_MIN guards) to extract AED amounts
- New db/queries/matches.ts: upsertProductMatch() + getBasketItemId() for auto-linking
  scraped Exa results to basket items without a separate matching step
- scrape.ts: selects ExaSearchAdapter when config.adapter === 'exa-search'; after
  insertObservation(), auto-creates canonical product and product_match (status: 'auto')
  so aggregate.ts can compute indices immediately without manual review
- All three UAE retailer YAMLs switched to adapter: exa-search and enabled: true;
  CSS extraction blocks removed (not used by search adapter)
- config/types.ts: adds 'exa-search' to adapter enum

* fix(consumer-prices): use EXA_API_KEYS (with fallback to EXA_API_KEY) matching PR #1904 pattern

* fix(consumer-prices): wire ConsumerPricesPanel in layout + fix movers limit:0 bug

Addresses Codex P1 findings on PR #1901:
- panel-layout.ts: import and createPanel('consumer-prices') so the panel
  actually renders in finance/commodity variants where it is enabled in config
- consumer-prices/index.ts: limit was hardcoded 0 causing slice(0,0) to always
  return empty risers/fallers after bootstrap is consumed; fixed to 10

* fix(consumer-prices): add categories snapshot to close P2 gap

consumer-prices:categories:ae:* was in BOOTSTRAP_KEYS but had no producer,
so the Categories tab always showed upstreamUnavailable.

- buildCategoriesSnapshot() in worldmonitor.ts — wraps buildTopCategories()
  and returns WMCategoriesSnapshot matching ListConsumerPriceCategoriesResponse
- /categories route in consumer-prices-core API
- publish.ts writes consumer-prices:categories:{market}:{range} for 7d/30d/90d
- seed-consumer-prices.mjs fetches all three ranges from consumer-prices-core
  and writes them to Redis alongside the other snapshots

P1 issues (snapshot structure mismatch + limit:0 movers) were already fixed
in earlier commits on this branch.

* fix(types): add variants? to PANEL_CATEGORY_MAP type
This commit is contained in:
Elie Habib
2026-03-20 17:08:22 +04:00
committed by GitHub
parent a8f8c0aa61
commit 7711e9de03
72 changed files with 6760 additions and 4 deletions

View File

@@ -0,0 +1,33 @@
# ─── Database ────────────────────────────────────────────────────────────────
DATABASE_URL=postgresql://user:password@localhost:5432/consumer_prices
# ─── Redis ────────────────────────────────────────────────────────────────────
REDIS_URL=redis://localhost:6379
# ─── Acquisition Providers ────────────────────────────────────────────────────
# Firecrawl — https://firecrawl.dev
FIRECRAWL_API_KEY=
# Exa — https://exa.ai
EXA_API_KEY=
# Parallel P0
P0_API_KEY=
P0_BASE_URL=https://api.parallelai.dev/v1
# ─── Object Storage (optional, for raw artifact retention) ───────────────────
ARTIFACTS_BUCKET_URL=
ARTIFACTS_BUCKET_KEY=
ARTIFACTS_BUCKET_SECRET=
# ─── API Server ───────────────────────────────────────────────────────────────
PORT=3400
HOST=0.0.0.0
LOG_LEVEL=info
# ─── Security ─────────────────────────────────────────────────────────────────
# Shared secret between WorldMonitor seed job and this service
WORLDMONITOR_SNAPSHOT_API_KEY=
# Allow CORS from specific origin (default: *)
CORS_ORIGIN=*

View File

@@ -0,0 +1,41 @@
FROM node:20-slim AS base
WORKDIR /app
# Install Playwright dependencies
RUN apt-get update && apt-get install -y \
chromium \
fonts-liberation \
libatk-bridge2.0-0 \
libatk1.0-0 \
libcairo2 \
libcups2 \
libdbus-1-3 \
libgdk-pixbuf2.0-0 \
libnspr4 \
libnss3 \
libpango-1.0-0 \
libx11-6 \
libxcomposite1 \
libxdamage1 \
libxext6 \
libxfixes3 \
libxrandr2 \
libxrender1 \
--no-install-recommends && rm -rf /var/lib/apt/lists/*
ENV PLAYWRIGHT_CHROMIUM_EXECUTABLE_PATH=/usr/bin/chromium
# Install dependencies
COPY package.json package-lock.json ./
RUN npm ci --omit=dev
# Build
COPY tsconfig.json ./
COPY src ./src
COPY configs ./configs
RUN npm run build
# Runtime
ENV NODE_ENV=production
EXPOSE 3400
CMD ["node", "dist/api/server.js"]

View File

@@ -0,0 +1,119 @@
basket:
slug: essentials-ae
name: Essentials Basket UAE
marketCode: ae
methodology: fixed
baseDate: "2025-01-01"
description: >
Core household essentials tracked weekly across UAE retailers.
Weighted to reflect a typical household of 4 in the UAE.
Does not represent official CPI. Tracks consumer price pressure only.
items:
- id: eggs_12
category: eggs
canonicalName: Eggs Fresh 12 Pack
weight: 0.12
baseUnit: ct
substitutionGroup: eggs
minBaseQty: 10
maxBaseQty: 15
- id: milk_1l
category: dairy
canonicalName: Full Fat Fresh Milk 1L
weight: 0.10
baseUnit: ml
substitutionGroup: milk_full_fat
minBaseQty: 900
maxBaseQty: 1100
- id: bread_white
category: bread
canonicalName: White Sliced Bread 600g
weight: 0.08
baseUnit: g
substitutionGroup: bread_white
minBaseQty: 500
maxBaseQty: 700
- id: rice_basmati_1kg
category: rice
canonicalName: Basmati Rice 1kg
weight: 0.10
baseUnit: g
substitutionGroup: rice_basmati
minBaseQty: 900
maxBaseQty: 1100
- id: cooking_oil_sunflower_1l
category: cooking_oil
canonicalName: Sunflower Oil 1L
weight: 0.08
baseUnit: ml
substitutionGroup: cooking_oil_sunflower
minBaseQty: 900
maxBaseQty: 1100
- id: chicken_whole_1kg
category: chicken
canonicalName: Whole Chicken Fresh 1kg
weight: 0.12
baseUnit: g
substitutionGroup: chicken_whole
minBaseQty: 800
maxBaseQty: 1200
- id: tomatoes_1kg
category: tomatoes
canonicalName: Tomatoes Fresh 1kg
weight: 0.08
baseUnit: g
substitutionGroup: tomatoes
minBaseQty: 800
maxBaseQty: 1200
- id: onions_1kg
category: onions
canonicalName: Onions 1kg
weight: 0.06
baseUnit: g
substitutionGroup: onions
minBaseQty: 800
maxBaseQty: 1200
- id: water_1_5l
category: water
canonicalName: Drinking Water 1.5L
weight: 0.08
baseUnit: ml
substitutionGroup: water_still
minBaseQty: 1400
maxBaseQty: 1600
- id: sugar_1kg
category: sugar
canonicalName: White Sugar 1kg
weight: 0.06
baseUnit: g
substitutionGroup: sugar_white
minBaseQty: 900
maxBaseQty: 1100
- id: cheese_processed_200g
category: dairy
canonicalName: Processed Cheese Slices 200g
weight: 0.06
baseUnit: g
substitutionGroup: cheese_processed
minBaseQty: 150
maxBaseQty: 250
- id: yogurt_500g
category: dairy
canonicalName: Plain Yogurt 500g
weight: 0.06
baseUnit: g
substitutionGroup: yogurt_plain
minBaseQty: 450
maxBaseQty: 550

View File

@@ -0,0 +1,28 @@
{
"aliases": {
"Almarai": ["almarai", "al marai", "الأمراء"],
"Barakat": ["barakat", "بركات"],
"Nada": ["nada", "ندى"],
"Lactalis": ["lactel", "lactalis"],
"President": ["président", "president"],
"Lurpak": ["lurpak", "لورباك"],
"Baladna": ["baladna"],
"KDD": ["kdd", "Kuwait Dairy Company"],
"Saudia": ["saudia dairy", "saudia"],
"Al Ain": ["al ain", "العين", "al-ain"],
"Masafi": ["masafi", "مسافي"],
"Evian": ["evian"],
"Volvic": ["volvic"],
"Nestle": ["nestle", "nestlé", "نستلة"],
"Kelloggs": ["kellogg's", "kelloggs", "kellog"],
"Uncle Ben's": ["uncle ben's", "uncle bens"],
"Tilda": ["tilda"],
"Daawat": ["daawat", "dawat"],
"India Gate": ["india gate"],
"Carrefour": ["carrefour", "كارفور"],
"Lulu": ["lulu", "lulu hypermarket"],
"Nadec": ["nadec"],
"Nabil": ["nabil"],
"Americana": ["americana"]
}
}

View File

@@ -0,0 +1,21 @@
retailer:
slug: carrefour_ae
name: Carrefour UAE
marketCode: ae
currencyCode: AED
adapter: exa-search
baseUrl: https://www.carrefouruae.com
enabled: true
acquisition:
provider: exa
rateLimit:
requestsPerMinute: 20
maxConcurrency: 2
delayBetweenRequestsMs: 3000
discovery:
mode: search
maxPages: 20
seeds: []

View File

@@ -0,0 +1,21 @@
retailer:
slug: lulu_ae
name: Lulu Hypermarket UAE
marketCode: ae
currencyCode: AED
adapter: exa-search
baseUrl: https://www.luluhypermarket.com
enabled: true
acquisition:
provider: exa
rateLimit:
requestsPerMinute: 15
maxConcurrency: 1
delayBetweenRequestsMs: 4000
discovery:
mode: search
maxPages: 20
seeds: []

View File

@@ -0,0 +1,21 @@
retailer:
slug: noon_grocery_ae
name: Noon Grocery UAE
marketCode: ae
currencyCode: AED
adapter: exa-search
baseUrl: https://www.noon.com
enabled: true
acquisition:
provider: exa
rateLimit:
requestsPerMinute: 10
maxConcurrency: 1
delayBetweenRequestsMs: 6000
discovery:
mode: search
maxPages: 20
seeds: []

View File

@@ -0,0 +1,195 @@
-- Consumer Prices Core: Initial Schema
-- Run: psql $DATABASE_URL < migrations/001_initial.sql
CREATE EXTENSION IF NOT EXISTS "pgcrypto";
-- ─── Retailers ────────────────────────────────────────────────────────────────
CREATE TABLE retailers (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
slug VARCHAR(64) NOT NULL UNIQUE,
name VARCHAR(128) NOT NULL,
market_code CHAR(2) NOT NULL,
country_code CHAR(2) NOT NULL,
currency_code CHAR(3) NOT NULL,
adapter_key VARCHAR(32) NOT NULL DEFAULT 'generic',
base_url TEXT NOT NULL,
active BOOLEAN NOT NULL DEFAULT TRUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE retailer_targets (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
retailer_id UUID NOT NULL REFERENCES retailers(id) ON DELETE CASCADE,
target_type VARCHAR(32) NOT NULL CHECK (target_type IN ('category_url','product_url','search_query')),
target_ref TEXT NOT NULL,
category_slug VARCHAR(64) NOT NULL,
enabled BOOLEAN NOT NULL DEFAULT TRUE,
last_scraped_at TIMESTAMPTZ
);
-- ─── Products ─────────────────────────────────────────────────────────────────
CREATE TABLE canonical_products (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
canonical_name VARCHAR(256) NOT NULL,
brand_norm VARCHAR(128),
category VARCHAR(64) NOT NULL,
variant_norm VARCHAR(128),
size_value NUMERIC(12,4),
size_unit VARCHAR(16),
base_quantity NUMERIC(12,4),
base_unit VARCHAR(16),
active BOOLEAN NOT NULL DEFAULT TRUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (canonical_name, brand_norm, category, variant_norm, size_value, size_unit)
);
CREATE TABLE retailer_products (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
retailer_id UUID NOT NULL REFERENCES retailers(id) ON DELETE CASCADE,
retailer_sku VARCHAR(128),
canonical_product_id UUID REFERENCES canonical_products(id),
source_url TEXT NOT NULL,
raw_title TEXT NOT NULL,
raw_brand TEXT,
raw_size_text TEXT,
image_url TEXT,
category_text TEXT,
first_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
last_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
active BOOLEAN NOT NULL DEFAULT TRUE,
UNIQUE (retailer_id, source_url)
);
CREATE INDEX idx_retailer_products_retailer ON retailer_products(retailer_id);
CREATE INDEX idx_retailer_products_canonical ON retailer_products(canonical_product_id) WHERE canonical_product_id IS NOT NULL;
-- ─── Observations ─────────────────────────────────────────────────────────────
CREATE TABLE scrape_runs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
retailer_id UUID NOT NULL REFERENCES retailers(id),
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
finished_at TIMESTAMPTZ,
status VARCHAR(16) NOT NULL DEFAULT 'running'
CHECK (status IN ('running','completed','failed','partial')),
trigger_type VARCHAR(16) NOT NULL DEFAULT 'scheduled'
CHECK (trigger_type IN ('scheduled','manual')),
pages_attempted INT NOT NULL DEFAULT 0,
pages_succeeded INT NOT NULL DEFAULT 0,
errors_count INT NOT NULL DEFAULT 0,
config_version VARCHAR(32) NOT NULL DEFAULT '1'
);
CREATE TABLE price_observations (
id BIGSERIAL PRIMARY KEY,
retailer_product_id UUID NOT NULL REFERENCES retailer_products(id),
scrape_run_id UUID NOT NULL REFERENCES scrape_runs(id),
observed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
price NUMERIC(12,2) NOT NULL,
list_price NUMERIC(12,2),
promo_price NUMERIC(12,2),
currency_code CHAR(3) NOT NULL,
unit_price NUMERIC(12,4),
unit_basis_qty NUMERIC(12,4),
unit_basis_unit VARCHAR(16),
in_stock BOOLEAN NOT NULL DEFAULT TRUE,
promo_text TEXT,
raw_payload_json JSONB NOT NULL DEFAULT '{}',
raw_hash VARCHAR(64) NOT NULL
);
CREATE INDEX idx_price_obs_product_time ON price_observations(retailer_product_id, observed_at DESC);
CREATE INDEX idx_price_obs_run ON price_observations(scrape_run_id);
-- ─── Matching ─────────────────────────────────────────────────────────────────
CREATE TABLE product_matches (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
retailer_product_id UUID NOT NULL REFERENCES retailer_products(id),
canonical_product_id UUID NOT NULL REFERENCES canonical_products(id),
basket_item_id UUID,
match_score NUMERIC(5,2) NOT NULL,
match_status VARCHAR(16) NOT NULL DEFAULT 'review'
CHECK (match_status IN ('auto','review','approved','rejected')),
evidence_json JSONB NOT NULL DEFAULT '{}',
reviewed_by VARCHAR(64),
reviewed_at TIMESTAMPTZ,
UNIQUE (retailer_product_id, canonical_product_id)
);
-- ─── Baskets ──────────────────────────────────────────────────────────────────
CREATE TABLE baskets (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
slug VARCHAR(64) NOT NULL UNIQUE,
name VARCHAR(128) NOT NULL,
market_code CHAR(2) NOT NULL,
methodology VARCHAR(16) NOT NULL CHECK (methodology IN ('fixed','value')),
base_date DATE NOT NULL,
description TEXT,
active BOOLEAN NOT NULL DEFAULT TRUE
);
CREATE TABLE basket_items (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
basket_id UUID NOT NULL REFERENCES baskets(id) ON DELETE CASCADE,
category VARCHAR(64) NOT NULL,
canonical_product_id UUID REFERENCES canonical_products(id),
substitution_group VARCHAR(64),
weight NUMERIC(5,4) NOT NULL,
qualification_rules_json JSONB,
active BOOLEAN NOT NULL DEFAULT TRUE
);
ALTER TABLE product_matches ADD CONSTRAINT fk_pm_basket_item
FOREIGN KEY (basket_item_id) REFERENCES basket_items(id);
-- ─── Analytics ────────────────────────────────────────────────────────────────
CREATE TABLE computed_indices (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
basket_id UUID NOT NULL REFERENCES baskets(id),
retailer_id UUID REFERENCES retailers(id),
category VARCHAR(64),
metric_date DATE NOT NULL,
metric_key VARCHAR(64) NOT NULL,
metric_value NUMERIC(14,4) NOT NULL,
methodology_version VARCHAR(16) NOT NULL DEFAULT '1',
UNIQUE (basket_id, retailer_id, category, metric_date, metric_key)
);
CREATE INDEX idx_computed_indices_basket_date ON computed_indices(basket_id, metric_date DESC);
-- ─── Operational ──────────────────────────────────────────────────────────────
CREATE TABLE source_artifacts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
scrape_run_id UUID NOT NULL REFERENCES scrape_runs(id),
retailer_product_id UUID REFERENCES retailer_products(id),
artifact_type VARCHAR(16) NOT NULL CHECK (artifact_type IN ('html','screenshot','parsed_json')),
storage_key TEXT NOT NULL,
content_type VARCHAR(64) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE data_source_health (
retailer_id UUID PRIMARY KEY REFERENCES retailers(id),
last_successful_run_at TIMESTAMPTZ,
last_run_status VARCHAR(16),
parse_success_rate NUMERIC(5,2),
avg_freshness_minutes NUMERIC(8,2),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- ─── Updated-at trigger ───────────────────────────────────────────────────────
CREATE OR REPLACE FUNCTION set_updated_at()
RETURNS TRIGGER LANGUAGE plpgsql AS $$
BEGIN NEW.updated_at = NOW(); RETURN NEW; END;
$$;
CREATE TRIGGER retailers_updated_at BEFORE UPDATE ON retailers
FOR EACH ROW EXECUTE FUNCTION set_updated_at();

View File

@@ -0,0 +1,40 @@
{
"name": "consumer-prices-core",
"version": "1.0.0",
"private": true,
"type": "module",
"scripts": {
"build": "tsc",
"start": "node dist/api/server.js",
"dev": "tsx watch src/api/server.ts",
"jobs:scrape": "tsx src/jobs/scrape.ts",
"jobs:aggregate": "tsx src/jobs/aggregate.ts",
"jobs:publish": "tsx src/jobs/publish.ts",
"migrate": "tsx src/db/migrate.ts",
"validate": "tsx src/cli/validate.ts",
"test": "vitest run",
"test:watch": "vitest"
},
"dependencies": {
"@fastify/cors": "^9.0.1",
"dotenv": "^16.4.5",
"exa-js": "^1.7.0",
"fastify": "^4.28.1",
"js-yaml": "^4.1.0",
"jsdom": "^25.0.1",
"pg": "^8.13.1",
"pino": "^9.4.0",
"playwright": "^1.47.2",
"redis": "^4.7.0",
"zod": "^3.23.8"
},
"devDependencies": {
"@types/js-yaml": "^4.0.9",
"@types/jsdom": "^21.1.7",
"@types/node": "^22.7.5",
"@types/pg": "^8.11.10",
"tsx": "^4.19.1",
"typescript": "^5.6.3",
"vitest": "^2.1.2"
}
}

View File

@@ -0,0 +1,85 @@
import Exa from 'exa-js';
import type { AcquisitionProvider, ExtractResult, ExtractSchema, FetchOptions, FetchResult, SearchOptions, SearchResult } from './types.js';
export class ExaProvider implements AcquisitionProvider {
readonly name = 'exa' as const;
private client: Exa;
constructor(apiKey: string) {
this.client = new Exa(apiKey);
}
async fetch(url: string, _opts: FetchOptions = {}): Promise<FetchResult> {
const result = await this.client.getContents([url], {
text: { maxCharacters: 100_000 },
highlights: { numSentences: 5, highlightsPerUrl: 3 },
});
const item = result.results[0];
if (!item) throw new Error(`Exa returned no content for ${url}`);
return {
url,
html: item.text ?? '',
markdown: item.text ?? '',
statusCode: 200,
provider: this.name,
fetchedAt: new Date(),
metadata: { highlights: item.highlights },
};
}
async search(query: string, opts: SearchOptions = {}): Promise<SearchResult[]> {
const result = await this.client.search(query, {
numResults: opts.numResults ?? 10,
type: opts.type ?? 'neural',
includeDomains: opts.includeDomains,
startPublishedDate: opts.startPublishedDate,
useAutoprompt: true,
});
return result.results.map((r) => ({
url: r.url,
title: r.title ?? '',
text: r.text,
highlights: r.highlights,
score: r.score,
publishedDate: r.publishedDate,
}));
}
async extract<T = Record<string, unknown>>(
url: string,
schema: ExtractSchema,
_opts: FetchOptions = {},
): Promise<ExtractResult<T>> {
const prompt = `Extract the following fields from this product page: ${Object.entries(schema.fields)
.map(([k, v]) => `${k} (${v.type}): ${v.description}`)
.join(', ')}`;
const result = await this.client.getContents([url], {
text: { maxCharacters: 50_000 },
summary: { query: prompt },
});
const item = result.results[0];
if (!item) throw new Error(`Exa returned no content for ${url}`);
return {
url,
data: (item as unknown as { summary?: T }).summary ?? ({} as T),
provider: this.name,
fetchedAt: new Date(),
};
}
async validate(): Promise<boolean> {
try {
await this.client.search('test', { numResults: 1 });
return true;
} catch {
return false;
}
}
}

View File

@@ -0,0 +1,142 @@
import type { AcquisitionProvider, ExtractResult, ExtractSchema, FetchOptions, FetchResult, SearchOptions, SearchResult } from './types.js';
interface FirecrawlScrapeResponse {
success: boolean;
data?: {
html?: string;
markdown?: string;
metadata?: Record<string, unknown>;
};
error?: string;
}
interface FirecrawlSearchResponse {
success: boolean;
data?: Array<{ url: string; title: string; description?: string; markdown?: string }>;
}
interface FirecrawlExtractResponse {
success: boolean;
data?: Record<string, unknown>;
}
export class FirecrawlProvider implements AcquisitionProvider {
readonly name = 'firecrawl' as const;
private readonly baseUrl = 'https://api.firecrawl.dev/v1';
constructor(private readonly apiKey: string) {}
private headers() {
return {
Authorization: `Bearer ${this.apiKey}`,
'Content-Type': 'application/json',
};
}
async fetch(url: string, opts: FetchOptions = {}): Promise<FetchResult> {
const resp = await fetch(`${this.baseUrl}/scrape`, {
method: 'POST',
headers: this.headers(),
body: JSON.stringify({
url,
formats: ['html', 'markdown'],
waitFor: opts.waitForSelector ? 2000 : 0,
timeout: opts.timeout ?? 30_000,
headers: opts.headers,
}),
signal: AbortSignal.timeout((opts.timeout ?? 30_000) + 5_000),
});
if (!resp.ok) throw new Error(`Firecrawl scrape failed: HTTP ${resp.status}`);
const data = (await resp.json()) as FirecrawlScrapeResponse;
if (!data.success || !data.data) {
throw new Error(`Firecrawl error: ${data.error ?? 'unknown'}`);
}
return {
url,
html: data.data.html ?? '',
markdown: data.data.markdown ?? '',
statusCode: 200,
provider: this.name,
fetchedAt: new Date(),
metadata: data.data.metadata,
};
}
async search(query: string, opts: SearchOptions = {}): Promise<SearchResult[]> {
const resp = await fetch(`${this.baseUrl}/search`, {
method: 'POST',
headers: this.headers(),
body: JSON.stringify({
query,
limit: opts.numResults ?? 10,
includeDomains: opts.includeDomains,
scrapeOptions: { formats: ['markdown'] },
}),
});
if (!resp.ok) throw new Error(`Firecrawl search failed: HTTP ${resp.status}`);
const data = (await resp.json()) as FirecrawlSearchResponse;
return (data.data ?? []).map((r) => ({
url: r.url,
title: r.title,
text: r.description ?? r.markdown,
}));
}
async extract<T = Record<string, unknown>>(
url: string,
schema: ExtractSchema,
opts: FetchOptions = {},
): Promise<ExtractResult<T>> {
const jsonSchema: Record<string, unknown> = {
type: 'object',
properties: Object.fromEntries(
Object.entries(schema.fields).map(([k, v]) => [
k,
{ type: v.type, description: v.description },
]),
),
};
const resp = await fetch(`${this.baseUrl}/scrape`, {
method: 'POST',
headers: this.headers(),
body: JSON.stringify({
url,
formats: ['extract'],
extract: { schema: jsonSchema },
timeout: opts.timeout ?? 30_000,
}),
});
if (!resp.ok) throw new Error(`Firecrawl extract failed: HTTP ${resp.status}`);
const data = (await resp.json()) as FirecrawlExtractResponse;
return {
url,
data: (data.data ?? {}) as T,
provider: this.name,
fetchedAt: new Date(),
};
}
async validate(): Promise<boolean> {
try {
const resp = await fetch(`${this.baseUrl}/scrape`, {
method: 'POST',
headers: this.headers(),
body: JSON.stringify({ url: 'https://example.com', formats: ['markdown'] }),
signal: AbortSignal.timeout(10_000),
});
return resp.ok;
} catch {
return false;
}
}
}

View File

@@ -0,0 +1,103 @@
/**
* Parallel P0 acquisition provider.
* P0 is a high-throughput scraping API that handles JS rendering,
* anti-bot, and proxy rotation. Compatible with its REST API.
*/
import type { AcquisitionProvider, FetchOptions, FetchResult, SearchOptions, SearchResult } from './types.js';
interface P0ScrapeResponse {
success: boolean;
html?: string;
markdown?: string;
statusCode?: number;
error?: string;
}
interface P0SearchResponse {
results?: Array<{ url: string; title: string; snippet?: string }>;
}
export class P0Provider implements AcquisitionProvider {
readonly name = 'p0' as const;
private readonly baseUrl: string;
constructor(
private readonly apiKey: string,
baseUrl = 'https://api.parallelai.dev/v1',
) {
this.baseUrl = baseUrl;
}
private headers() {
return {
'x-api-key': this.apiKey,
'Content-Type': 'application/json',
};
}
async fetch(url: string, opts: FetchOptions = {}): Promise<FetchResult> {
const resp = await fetch(`${this.baseUrl}/scrape`, {
method: 'POST',
headers: this.headers(),
body: JSON.stringify({
url,
render_js: true,
wait_for: opts.waitForSelector,
timeout: Math.floor((opts.timeout ?? 30_000) / 1_000),
output_format: 'html',
premium_proxy: true,
}),
signal: AbortSignal.timeout((opts.timeout ?? 30_000) + 10_000),
});
if (!resp.ok) throw new Error(`P0 scrape failed: HTTP ${resp.status}`);
const data = (await resp.json()) as P0ScrapeResponse;
if (!data.success && !data.html) {
throw new Error(`P0 error: ${data.error ?? 'no content'}`);
}
return {
url,
html: data.html ?? '',
markdown: data.markdown,
statusCode: data.statusCode ?? 200,
provider: this.name,
fetchedAt: new Date(),
};
}
async search(query: string, opts: SearchOptions = {}): Promise<SearchResult[]> {
const resp = await fetch(`${this.baseUrl}/search`, {
method: 'POST',
headers: this.headers(),
body: JSON.stringify({
query,
num_results: opts.numResults ?? 10,
include_domains: opts.includeDomains,
}),
});
if (!resp.ok) throw new Error(`P0 search failed: HTTP ${resp.status}`);
const data = (await resp.json()) as P0SearchResponse;
return (data.results ?? []).map((r) => ({
url: r.url,
title: r.title,
text: r.snippet,
}));
}
async validate(): Promise<boolean> {
try {
const resp = await fetch(`${this.baseUrl}/health`, {
headers: this.headers(),
signal: AbortSignal.timeout(5_000),
});
return resp.ok;
} catch {
return false;
}
}
}

View File

@@ -0,0 +1,73 @@
import { chromium, type Browser, type BrowserContext } from 'playwright';
import type { AcquisitionProvider, FetchOptions, FetchResult, SearchOptions, SearchResult } from './types.js';
const DEFAULT_UA =
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36';
export class PlaywrightProvider implements AcquisitionProvider {
readonly name = 'playwright' as const;
private browser: Browser | null = null;
private context: BrowserContext | null = null;
private async getContext(): Promise<BrowserContext> {
if (!this.browser) {
this.browser = await chromium.launch({ headless: true });
}
if (!this.context) {
this.context = await this.browser.newContext({
userAgent: DEFAULT_UA,
locale: 'en-US',
viewport: { width: 1280, height: 900 },
extraHTTPHeaders: { 'Accept-Language': 'en-US,en;q=0.9' },
});
}
return this.context;
}
async fetch(url: string, opts: FetchOptions = {}): Promise<FetchResult> {
const ctx = await this.getContext();
const page = await ctx.newPage();
const timeout = opts.timeout ?? 30_000;
try {
if (opts.headers) {
await page.setExtraHTTPHeaders(opts.headers);
}
const response = await page.goto(url, { waitUntil: 'domcontentloaded', timeout });
if (opts.waitForSelector) {
await page.waitForSelector(opts.waitForSelector, { timeout: 10_000 }).catch(() => {});
}
const html = await page.content();
const statusCode = response?.status() ?? 200;
return { url, html, statusCode, provider: this.name, fetchedAt: new Date() };
} finally {
await page.close();
}
}
async search(_query: string, _opts?: SearchOptions): Promise<SearchResult[]> {
throw new Error('PlaywrightProvider does not support search mode. Use Exa instead.');
}
async validate(): Promise<boolean> {
try {
await this.getContext();
return true;
} catch {
return false;
}
}
async teardown(): Promise<void> {
await this.context?.close();
await this.browser?.close();
this.context = null;
this.browser = null;
}
}

View File

@@ -0,0 +1,59 @@
import { ExaProvider } from './exa.js';
import { FirecrawlProvider } from './firecrawl.js';
import { P0Provider } from './p0.js';
import { PlaywrightProvider } from './playwright.js';
import type { AcquisitionConfig, AcquisitionProvider, AcquisitionProviderName, FetchOptions, FetchResult } from './types.js';
const _providers = new Map<AcquisitionProviderName, AcquisitionProvider>();
export function initProviders(env: Record<string, string | undefined>) {
_providers.set('playwright', new PlaywrightProvider());
if (env.EXA_API_KEY) {
_providers.set('exa', new ExaProvider(env.EXA_API_KEY));
}
if (env.FIRECRAWL_API_KEY) {
_providers.set('firecrawl', new FirecrawlProvider(env.FIRECRAWL_API_KEY));
}
if (env.P0_API_KEY) {
_providers.set('p0', new P0Provider(env.P0_API_KEY, env.P0_BASE_URL));
}
}
export function getProvider(name: AcquisitionProviderName): AcquisitionProvider {
const p = _providers.get(name);
if (!p) throw new Error(`Acquisition provider '${name}' is not configured. Set the required API key env var.`);
return p;
}
export async function teardownAll(): Promise<void> {
for (const p of _providers.values()) {
await p.teardown?.();
}
_providers.clear();
}
/**
* Fetch a URL using the provider chain defined in config.
* Tries primary provider first; on failure, tries fallback.
*/
export async function fetchWithFallback(
url: string,
config: AcquisitionConfig,
opts?: FetchOptions,
): Promise<FetchResult> {
const primary = getProvider(config.provider);
const mergedOpts = { ...config.options, ...opts };
try {
return await primary.fetch(url, mergedOpts);
} catch (err) {
if (!config.fallback) throw err;
const msg = err instanceof Error ? err.message : String(err);
console.warn(`[acquisition] ${config.provider} failed for ${url}: ${msg}. Falling back to ${config.fallback}.`);
const fallback = getProvider(config.fallback);
return fallback.fetch(url, mergedOpts);
}
}

View File

@@ -0,0 +1,78 @@
export type AcquisitionProviderName = 'playwright' | 'exa' | 'firecrawl' | 'p0';
export interface FetchOptions {
waitForSelector?: string;
timeout?: number;
headers?: Record<string, string>;
retries?: number;
userAgent?: string;
}
export interface SearchOptions {
numResults?: number;
includeDomains?: string[];
startPublishedDate?: string;
type?: 'keyword' | 'neural';
}
export interface ExtractSchema {
fields: Record<string, { description: string; type: 'string' | 'number' | 'boolean' | 'array' }>;
}
export interface FetchResult {
url: string;
html: string;
markdown?: string;
statusCode: number;
provider: AcquisitionProviderName;
fetchedAt: Date;
metadata?: Record<string, unknown>;
}
export interface SearchResult {
url: string;
title: string;
text?: string;
highlights?: string[];
score?: number;
publishedDate?: string;
}
export interface ExtractResult<T = Record<string, unknown>> {
url: string;
data: T;
provider: AcquisitionProviderName;
fetchedAt: Date;
}
export interface AcquisitionProvider {
readonly name: AcquisitionProviderName;
/** Fetch a URL, returning HTML content. */
fetch(url: string, opts?: FetchOptions): Promise<FetchResult>;
/** Search for pages matching a query (Exa primary, others may not support). */
search?(query: string, opts?: SearchOptions): Promise<SearchResult[]>;
/** Extract structured data from a URL using a schema hint. */
extract?<T = Record<string, unknown>>(url: string, schema: ExtractSchema, opts?: FetchOptions): Promise<ExtractResult<T>>;
/** Validate provider is configured and reachable. */
validate(): Promise<boolean>;
/** Clean up resources (close browser, etc.) */
teardown?(): Promise<void>;
}
export interface AcquisitionConfig {
/** Primary acquisition provider. */
provider: AcquisitionProviderName;
/** Fallback provider if primary fails. */
fallback?: AcquisitionProviderName;
/** Provider-specific options. */
options?: FetchOptions;
/** Use search mode instead of direct URL fetch (Exa only). */
searchMode?: boolean;
/** Query template for search mode: use {category}, {product}, {market} tokens. */
searchQueryTemplate?: string;
}

View File

@@ -0,0 +1,218 @@
/**
* ExaSearchAdapter — acquires prices via Exa AI neural search + summary extraction.
* Ported from scripts/seed-grocery-basket.mjs (PR #1904).
*
* Instead of fetching category pages and parsing CSS selectors, this adapter:
* 1. Discovers targets from the basket YAML config (one target per basket item)
* 2. Calls Exa with contents.summary to get AI-extracted price text from retailer pages
* 3. Uses regex to extract the price from the summary
*
* Basket → product match is written automatically (match_status: 'auto')
* because the search is item-specific — no ambiguity in what was searched.
*/
import { loadAllBasketConfigs } from '../config/loader.js';
import type { AdapterContext, FetchResult, ParsedProduct, RetailerAdapter, Target } from './types.js';
import type { RetailerConfig } from '../config/types.js';
const CHROME_UA =
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36';
const CCY =
'USD|GBP|EUR|JPY|CNY|INR|AUD|CAD|BRL|MXN|ZAR|TRY|NGN|KRW|SGD|PKR|AED|SAR|QAR|KWD|BHD|OMR|EGP|JOD|LBP|KES|ARS|IDR|PHP';
const SYMBOL_MAP: Record<string, string> = {
'£': 'GBP',
'€': 'EUR',
'¥': 'JPY',
'₩': 'KRW',
'₹': 'INR',
'₦': 'NGN',
'R$': 'BRL',
};
const CURRENCY_MIN: Record<string, number> = {
NGN: 50,
IDR: 500,
ARS: 50,
KRW: 1000,
ZAR: 2,
PKR: 20,
LBP: 1000,
};
const PRICE_PATTERNS = [
new RegExp(`(\\d+(?:\\.\\d{1,3})?)\\s*(${CCY})`, 'i'),
new RegExp(`(${CCY})\\s*(\\d+(?:\\.\\d{1,3})?)`, 'i'),
];
function matchPrice(text: string, expectedCurrency: string): number | null {
for (const re of PRICE_PATTERNS) {
const match = text.match(re);
if (match) {
const [price, currency] = /^\d/.test(match[1])
? [parseFloat(match[1]), match[2].toUpperCase()]
: [parseFloat(match[2]), match[1].toUpperCase()];
if (currency !== expectedCurrency) continue;
const minPrice = CURRENCY_MIN[currency] ?? 0;
if (price > minPrice && price < 100_000) return price;
}
}
for (const [sym, iso] of Object.entries(SYMBOL_MAP)) {
if (iso !== expectedCurrency) continue;
const re = new RegExp(`${sym.replace('$', '\\$')}\\s*(\\d+(?:[.,]\\d{1,3})?)`, 'i');
const m = text.match(re);
if (m) {
const price = parseFloat(m[1].replace(',', '.'));
const minPrice = CURRENCY_MIN[iso] ?? 0;
if (price > minPrice && price < 100_000) return price;
}
}
return null;
}
interface ExaResult {
url?: string;
title?: string;
summary?: string;
}
interface SearchPayload {
exaResults: ExaResult[];
basketSlug: string;
itemCategory: string;
canonicalName: string;
}
export class ExaSearchAdapter implements RetailerAdapter {
readonly key = 'exa-search';
constructor(private readonly apiKey: string) {}
async discoverTargets(ctx: AdapterContext): Promise<Target[]> {
const baskets = loadAllBasketConfigs().filter((b) => b.marketCode === ctx.config.marketCode);
const domain = new URL(ctx.config.baseUrl).hostname;
const targets: Target[] = [];
for (const basket of baskets) {
for (const item of basket.items) {
targets.push({
id: item.id,
url: ctx.config.baseUrl,
category: item.category,
metadata: {
canonicalName: item.canonicalName,
domain,
basketSlug: basket.slug,
currency: ctx.config.currencyCode,
},
});
}
}
return targets;
}
async fetchTarget(ctx: AdapterContext, target: Target): Promise<FetchResult> {
if (!this.apiKey) throw new Error('EXA_API_KEY is required for exa-search adapter');
const { canonicalName, domain, currency, basketSlug } = target.metadata as {
canonicalName: string;
domain: string;
currency: string;
basketSlug: string;
};
const body = {
query: `${canonicalName} ${currency} retail price`,
numResults: 5,
type: 'auto',
includeDomains: [domain],
contents: {
summary: {
query: `What is the retail price of this product? State amount and ISO currency code (e.g. ${currency} 12.50).`,
},
},
};
const resp = await fetch('https://api.exa.ai/search', {
method: 'POST',
headers: {
'x-api-key': this.apiKey,
'Content-Type': 'application/json',
'User-Agent': CHROME_UA,
},
body: JSON.stringify(body),
signal: AbortSignal.timeout(15_000),
});
if (!resp.ok) {
const text = await resp.text().catch(() => '');
throw new Error(`Exa search failed HTTP ${resp.status}: ${text.slice(0, 120)}`);
}
const data = (await resp.json()) as { results?: ExaResult[] };
const payload: SearchPayload = {
exaResults: data.results ?? [],
basketSlug,
itemCategory: target.category,
canonicalName,
};
return {
url: target.url,
html: JSON.stringify(payload),
statusCode: 200,
fetchedAt: new Date(),
};
}
async parseListing(ctx: AdapterContext, result: FetchResult): Promise<ParsedProduct[]> {
const payload = JSON.parse(result.html) as SearchPayload;
const currency = ctx.config.currencyCode;
for (const r of payload.exaResults) {
const price =
matchPrice(r.summary ?? '', currency) ??
matchPrice(r.title ?? '', currency);
if (price !== null) {
return [
{
sourceUrl: r.url ?? ctx.config.baseUrl,
rawTitle: r.title ?? payload.canonicalName,
rawBrand: null,
rawSizeText: null,
imageUrl: null,
categoryText: payload.itemCategory,
retailerSku: null,
price,
listPrice: null,
promoPrice: null,
promoText: null,
inStock: true,
rawPayload: {
exaUrl: r.url,
summary: r.summary,
basketSlug: payload.basketSlug,
itemCategory: payload.itemCategory,
canonicalName: payload.canonicalName,
},
},
];
}
}
return [];
}
async parseProduct(_ctx: AdapterContext, _result: FetchResult): Promise<ParsedProduct> {
throw new Error('ExaSearchAdapter does not support single-product parsing');
}
async validateConfig(config: RetailerConfig): Promise<string[]> {
const errors: string[] = [];
if (!this.apiKey) errors.push('EXA_API_KEY env var is required for adapter: exa-search');
if (!config.baseUrl) errors.push('baseUrl is required');
return errors;
}
}

View File

@@ -0,0 +1,165 @@
/**
* Generic config-driven adapter.
* Uses CSS selectors from the retailer YAML to extract products.
* Works with any acquisition provider (Playwright, Firecrawl, Exa, P0).
*/
// eslint-disable-next-line @typescript-eslint/ban-ts-comment
// @ts-ignore — jsdom types provided via @types/jsdom
import { JSDOM } from 'jsdom';
import { fetchWithFallback } from '../acquisition/registry.js';
import type { AdapterContext, FetchResult, ParsedProduct, RetailerAdapter, Target } from './types.js';
import type { RetailerConfig } from '../config/types.js';
function parsePrice(text: string | null | undefined, config: RetailerConfig): number | null {
if (!text) return null;
const fmt = config.extraction?.priceFormat;
let clean = text;
if (fmt?.currencySymbols) {
for (const sym of fmt.currencySymbols) {
clean = clean.replace(sym, '');
}
}
const dec = fmt?.decimalSeparator ?? '.';
const thou = fmt?.thousandsSeparator ?? ',';
clean = clean.replace(new RegExp(`\\${thou}`, 'g'), '').replace(dec, '.').replace(/[^\d.]/g, '').trim();
const val = parseFloat(clean);
return isNaN(val) ? null : val;
}
function selectText(doc: Document, selector: string): string | null {
if (!selector) return null;
if (selector.includes('::attr(')) {
const [sel, attr] = selector.replace(')', '').split('::attr(');
const el = doc.querySelector(sel.trim());
return el?.getAttribute(attr.trim()) ?? null;
}
return doc.querySelector(selector)?.textContent?.trim() ?? null;
}
export class GenericPlaywrightAdapter implements RetailerAdapter {
readonly key = 'generic';
async discoverTargets(ctx: AdapterContext): Promise<Target[]> {
return ctx.config.discovery.seeds.map((s) => ({
id: s.id,
url: s.url.startsWith('http') ? s.url : `${ctx.config.baseUrl}${s.url}`,
category: s.category ?? s.id,
}));
}
async fetchTarget(ctx: AdapterContext, target: Target): Promise<FetchResult> {
const result = await fetchWithFallback(target.url, ctx.config.acquisition, ctx.config.rateLimit ? {
timeout: 30_000,
} : undefined);
return {
url: result.url,
html: result.html,
markdown: result.markdown,
statusCode: result.statusCode,
fetchedAt: result.fetchedAt,
};
}
async parseListing(ctx: AdapterContext, result: FetchResult): Promise<ParsedProduct[]> {
const selectors = ctx.config.extraction?.productCard;
if (!selectors) return [];
const dom = new JSDOM(result.html);
const doc = dom.window.document;
const cards = doc.querySelectorAll(selectors.container);
const products: ParsedProduct[] = [];
for (const card of cards) {
try {
const rawTitle = selectText(card as unknown as Document, selectors.title) ?? '';
if (!rawTitle) continue;
const priceText = selectText(card as unknown as Document, selectors.price);
const price = parsePrice(priceText, ctx.config);
if (!price) continue;
const listPriceText = selectors.listPrice
? selectText(card as unknown as Document, selectors.listPrice)
: null;
const listPrice = parsePrice(listPriceText, ctx.config);
const relUrl = selectText(card as unknown as Document, selectors.url) ?? '';
const sourceUrl = relUrl.startsWith('http') ? relUrl : `${ctx.config.baseUrl}${relUrl}`;
products.push({
sourceUrl,
rawTitle,
rawBrand: selectors.brand ? selectText(card as unknown as Document, selectors.brand) : null,
rawSizeText: selectors.sizeText
? selectText(card as unknown as Document, selectors.sizeText)
: null,
imageUrl: selectors.imageUrl
? selectText(card as unknown as Document, selectors.imageUrl)
: null,
categoryText: null,
retailerSku: selectors.sku ? selectText(card as unknown as Document, selectors.sku) : null,
price,
listPrice,
promoPrice: price < (listPrice ?? price) ? price : null,
promoText: null,
inStock: true,
rawPayload: { title: rawTitle, price: priceText, url: relUrl },
});
} catch (err) {
ctx.logger.warn(`[generic] parse error on card: ${err}`);
}
}
return products;
}
async parseProduct(ctx: AdapterContext, result: FetchResult): Promise<ParsedProduct> {
const selectors = ctx.config.extraction?.productPage;
const dom = new JSDOM(result.html);
const doc = dom.window.document;
const rawTitle = selectors?.title ? (selectText(doc, selectors.title) ?? '') : '';
const priceText = selectors?.price ? selectText(doc, selectors.price) : null;
const price = parsePrice(priceText, ctx.config) ?? 0;
const jsonld = selectors?.jsonld ? doc.querySelector(selectors.jsonld)?.textContent : null;
let jsonldData: Record<string, unknown> = {};
if (jsonld) {
try { jsonldData = JSON.parse(jsonld) as Record<string, unknown>; } catch {}
}
return {
sourceUrl: result.url,
rawTitle: rawTitle || (jsonldData.name as string) || '',
rawBrand: (jsonldData.brand as { name?: string })?.name ?? null,
rawSizeText: null,
imageUrl: (jsonldData.image as string) ?? null,
categoryText: selectors?.categoryPath ? selectText(doc, selectors.categoryPath) : null,
retailerSku: selectors?.sku ? selectText(doc, selectors.sku) : null,
price,
listPrice: null,
promoPrice: null,
promoText: null,
inStock: true,
rawPayload: { title: rawTitle, price: priceText, jsonld: jsonldData },
};
}
async validateConfig(config: RetailerConfig): Promise<string[]> {
const errors: string[] = [];
if (!config.baseUrl) errors.push('baseUrl is required');
if (!config.discovery.seeds?.length) errors.push('at least one discovery seed is required');
if (!config.extraction?.productCard?.container) errors.push('extraction.productCard.container is required');
return errors;
}
}

View File

@@ -0,0 +1,48 @@
import type { RetailerConfig } from '../config/types.js';
export interface ParsedProduct {
sourceUrl: string;
rawTitle: string;
rawBrand: string | null;
rawSizeText: string | null;
imageUrl: string | null;
categoryText: string | null;
retailerSku: string | null;
price: number;
listPrice: number | null;
promoPrice: number | null;
promoText: string | null;
inStock: boolean;
rawPayload: Record<string, unknown>;
}
export interface AdapterContext {
config: RetailerConfig;
runId: string;
logger: { info: (msg: string, ...args: unknown[]) => void; warn: (msg: string, ...args: unknown[]) => void; error: (msg: string, ...args: unknown[]) => void };
}
export interface Target {
id: string;
url: string;
category: string;
metadata?: Record<string, unknown>;
}
export interface FetchResult {
url: string;
html: string;
markdown?: string;
statusCode: number;
fetchedAt: Date;
}
export interface RetailerAdapter {
readonly key: string;
discoverTargets(ctx: AdapterContext): Promise<Target[]>;
fetchTarget(ctx: AdapterContext, target: Target): Promise<FetchResult>;
parseListing(ctx: AdapterContext, result: FetchResult): Promise<ParsedProduct[]>;
parseProduct(ctx: AdapterContext, result: FetchResult): Promise<ParsedProduct>;
validateConfig(config: RetailerConfig): Promise<string[]>;
}

View File

@@ -0,0 +1,22 @@
import type { FastifyInstance } from 'fastify';
import { getPool } from '../../db/client.js';
export async function healthRoutes(fastify: FastifyInstance) {
fastify.get('/', async (_request, reply) => {
const checks: Record<string, 'ok' | 'fail'> = {};
try {
await getPool().query('SELECT 1');
checks.postgres = 'ok';
} catch {
checks.postgres = 'fail';
}
const healthy = Object.values(checks).every((v) => v === 'ok');
return reply.status(healthy ? 200 : 503).send({
status: healthy ? 'ok' : 'degraded',
checks,
timestamp: new Date().toISOString(),
});
});
}

View File

@@ -0,0 +1,84 @@
import type { FastifyInstance } from 'fastify';
import {
buildBasketSeriesSnapshot,
buildCategoriesSnapshot,
buildFreshnessSnapshot,
buildMoversSnapshot,
buildOverviewSnapshot,
buildRetailerSpreadSnapshot,
} from '../../snapshots/worldmonitor.js';
export async function worldmonitorRoutes(fastify: FastifyInstance) {
fastify.get('/overview', async (request, reply) => {
const { market = 'ae' } = request.query as { market?: string };
try {
const data = await buildOverviewSnapshot(market);
return reply.send(data);
} catch (err) {
fastify.log.error(err);
return reply.status(500).send({ error: 'failed to build overview snapshot' });
}
});
fastify.get('/movers', async (request, reply) => {
const { market = 'ae', days = '30' } = request.query as { market?: string; days?: string };
try {
const data = await buildMoversSnapshot(market, parseInt(days, 10));
return reply.send(data);
} catch (err) {
fastify.log.error(err);
return reply.status(500).send({ error: 'failed to build movers snapshot' });
}
});
fastify.get('/retailer-spread', async (request, reply) => {
const { market = 'ae', basket = 'essentials-ae' } = request.query as {
market?: string;
basket?: string;
};
try {
const data = await buildRetailerSpreadSnapshot(market, basket);
return reply.send(data);
} catch (err) {
fastify.log.error(err);
return reply.status(500).send({ error: 'failed to build retailer spread snapshot' });
}
});
fastify.get('/freshness', async (request, reply) => {
const { market = 'ae' } = request.query as { market?: string };
try {
const data = await buildFreshnessSnapshot(market);
return reply.send(data);
} catch (err) {
fastify.log.error(err);
return reply.status(500).send({ error: 'failed to build freshness snapshot' });
}
});
fastify.get('/categories', async (request, reply) => {
const { market = 'ae', range = '30d' } = request.query as { market?: string; range?: string };
try {
const data = await buildCategoriesSnapshot(market, range);
return reply.send(data);
} catch (err) {
fastify.log.error(err);
return reply.status(500).send({ error: 'failed to build categories snapshot' });
}
});
fastify.get('/basket-series', async (request, reply) => {
const { market = 'ae', basket = 'essentials-ae', range = '30d' } = request.query as {
market?: string;
basket?: string;
range?: string;
};
try {
const data = await buildBasketSeriesSnapshot(market, basket, range);
return reply.send(data);
} catch (err) {
fastify.log.error(err);
return reply.status(500).send({ error: 'failed to build basket series snapshot' });
}
});
}

View File

@@ -0,0 +1,39 @@
import 'dotenv/config';
import Fastify from 'fastify';
import cors from '@fastify/cors';
import { worldmonitorRoutes } from './routes/worldmonitor.js';
import { healthRoutes } from './routes/health.js';
const server = Fastify({ logger: { level: process.env.LOG_LEVEL ?? 'info' } });
await server.register(cors, {
origin: process.env.CORS_ORIGIN ?? '*',
methods: ['GET'],
});
const API_KEY = process.env.WORLDMONITOR_SNAPSHOT_API_KEY;
server.addHook('onRequest', async (request, reply) => {
if (request.url === '/health') return;
if (API_KEY) {
const provided = request.headers['x-api-key'];
if (provided !== API_KEY) {
await reply.status(401).send({ error: 'unauthorized' });
}
}
});
await server.register(worldmonitorRoutes, { prefix: '/wm/consumer-prices/v1' });
await server.register(healthRoutes, { prefix: '/health' });
const port = parseInt(process.env.PORT ?? '3400', 10);
const host = process.env.HOST ?? '0.0.0.0';
try {
await server.listen({ port, host });
console.log(`consumer-prices-core listening on ${host}:${port}`);
} catch (err) {
server.log.error(err);
process.exit(1);
}

View File

@@ -0,0 +1,40 @@
import { readFileSync, readdirSync } from 'node:fs';
import { join, dirname } from 'node:path';
import { fileURLToPath } from 'node:url';
import yaml from 'js-yaml';
import { BasketConfigSchema, RetailerConfigSchema } from './types.js';
import type { BasketConfig, RetailerConfig } from './types.js';
const CONFIG_DIR = join(dirname(fileURLToPath(import.meta.url)), '../../configs');
export function loadRetailerConfig(slug: string): RetailerConfig {
const filePath = join(CONFIG_DIR, 'retailers', `${slug}.yaml`);
const raw = readFileSync(filePath, 'utf8');
const parsed = RetailerConfigSchema.parse(yaml.load(raw));
return parsed.retailer;
}
export function loadAllRetailerConfigs(): RetailerConfig[] {
const dir = join(CONFIG_DIR, 'retailers');
const files = readdirSync(dir).filter((f) => f.endsWith('.yaml'));
return files.map((f) => {
const raw = readFileSync(join(dir, f), 'utf8');
return RetailerConfigSchema.parse(yaml.load(raw)).retailer;
});
}
export function loadBasketConfig(slug: string): BasketConfig {
const filePath = join(CONFIG_DIR, 'baskets', `${slug}.yaml`);
const raw = readFileSync(filePath, 'utf8');
const parsed = BasketConfigSchema.parse(yaml.load(raw));
return parsed.basket;
}
export function loadAllBasketConfigs(): BasketConfig[] {
const dir = join(CONFIG_DIR, 'baskets');
const files = readdirSync(dir).filter((f) => f.endsWith('.yaml'));
return files.map((f) => {
const raw = readFileSync(join(dir, f), 'utf8');
return BasketConfigSchema.parse(yaml.load(raw)).basket;
});
}

View File

@@ -0,0 +1,110 @@
import { z } from 'zod';
export const AcquisitionConfigSchema = z.object({
provider: z.enum(['playwright', 'exa', 'firecrawl', 'p0']),
fallback: z.enum(['playwright', 'exa', 'firecrawl', 'p0']).optional(),
options: z
.object({
waitForSelector: z.string().optional(),
timeout: z.number().optional(),
retries: z.number().optional(),
})
.optional(),
searchMode: z.boolean().optional(),
searchQueryTemplate: z.string().optional(),
});
export const RateLimitSchema = z.object({
requestsPerMinute: z.number().default(30),
maxConcurrency: z.number().default(2),
delayBetweenRequestsMs: z.number().default(2_000),
});
export const ProductCardSelectorsSchema = z.object({
container: z.string(),
title: z.string(),
price: z.string(),
listPrice: z.string().optional(),
url: z.string(),
imageUrl: z.string().optional(),
sizeText: z.string().optional(),
inStock: z.string().optional(),
sku: z.string().optional(),
brand: z.string().optional(),
});
export const ProductPageSelectorsSchema = z.object({
title: z.string(),
sku: z.string().optional(),
categoryPath: z.string().optional(),
jsonld: z.string().optional(),
price: z.string().optional(),
brand: z.string().optional(),
sizeText: z.string().optional(),
});
export const DiscoverySeedSchema = z.object({
id: z.string(),
url: z.string(),
category: z.string().optional(),
});
export const RetailerConfigSchema = z.object({
retailer: z.object({
slug: z.string(),
name: z.string(),
marketCode: z.string().length(2),
currencyCode: z.string().length(3),
adapter: z.enum(['generic', 'exa-search', 'custom']).default('generic'),
baseUrl: z.string().url(),
rateLimit: RateLimitSchema.optional(),
acquisition: AcquisitionConfigSchema,
discovery: z.object({
mode: z.enum(['category_urls', 'sitemap', 'search']).default('category_urls'),
seeds: z.array(DiscoverySeedSchema),
paginationSelector: z.string().optional(),
maxPages: z.number().default(20),
}),
extraction: z.object({
productCard: ProductCardSelectorsSchema.optional(),
productPage: ProductPageSelectorsSchema.optional(),
priceFormat: z
.object({
decimalSeparator: z.string().default('.'),
thousandsSeparator: z.string().default(','),
currencySymbols: z.array(z.string()).default([]),
})
.optional(),
}),
enabled: z.boolean().default(true),
}),
});
export type RetailerConfig = z.infer<typeof RetailerConfigSchema>['retailer'];
export const BasketItemSchema = z.object({
id: z.string(),
category: z.string(),
canonicalName: z.string(),
weight: z.number().min(0).max(1),
baseUnit: z.string(),
substitutionGroup: z.string().optional(),
minBaseQty: z.number().optional(),
maxBaseQty: z.number().optional(),
qualificationRules: z.record(z.unknown()).optional(),
});
export const BasketConfigSchema = z.object({
basket: z.object({
slug: z.string(),
name: z.string(),
marketCode: z.string().length(2),
methodology: z.enum(['fixed', 'value']),
baseDate: z.string(),
description: z.string().optional(),
items: z.array(BasketItemSchema),
}),
});
export type BasketConfig = z.infer<typeof BasketConfigSchema>['basket'];
export type BasketItem = z.infer<typeof BasketItemSchema>;

View File

@@ -0,0 +1,38 @@
import pg from 'pg';
const { Pool } = pg;
let _pool: pg.Pool | null = null;
export function getPool(): pg.Pool {
if (!_pool) {
const databaseUrl = process.env.DATABASE_URL;
if (!databaseUrl) throw new Error('DATABASE_URL is not set');
_pool = new Pool({
connectionString: databaseUrl,
max: 10,
idleTimeoutMillis: 30_000,
connectionTimeoutMillis: 5_000,
ssl: databaseUrl.includes('localhost') ? false : true,
});
_pool.on('error', (err) => {
console.error('[db] pool error:', err.message);
});
}
return _pool;
}
export async function query<T = Record<string, unknown>>(
sql: string,
params?: unknown[],
): Promise<pg.QueryResult<T>> {
const pool = getPool();
return pool.query<T>(sql, params);
}
export async function closePool(): Promise<void> {
await _pool?.end();
_pool = null;
}

View File

@@ -0,0 +1,60 @@
/**
* Simple forward-only migration runner.
* Run: tsx src/db/migrate.ts
*/
import { readFileSync, readdirSync } from 'node:fs';
import { join, dirname } from 'node:path';
import { fileURLToPath } from 'node:url';
import 'dotenv/config';
import { getPool } from './client.js';
const MIGRATIONS_DIR = join(dirname(fileURLToPath(import.meta.url)), '../../migrations');
async function run() {
const pool = getPool();
await pool.query(`
CREATE TABLE IF NOT EXISTS schema_migrations (
version VARCHAR(64) PRIMARY KEY,
applied_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
)
`);
const applied = await pool.query<{ version: string }>(`SELECT version FROM schema_migrations ORDER BY version`);
const appliedSet = new Set(applied.rows.map((r) => r.version));
const files = readdirSync(MIGRATIONS_DIR)
.filter((f) => f.endsWith('.sql'))
.sort();
for (const file of files) {
const version = file.replace('.sql', '');
if (appliedSet.has(version)) {
console.log(` [skip] ${file}`);
continue;
}
const sql = readFileSync(join(MIGRATIONS_DIR, file), 'utf8');
console.log(` [run] ${file}`);
const client = await pool.connect();
try {
await client.query('BEGIN');
await client.query(sql);
await client.query(`INSERT INTO schema_migrations (version) VALUES ($1)`, [version]);
await client.query('COMMIT');
console.log(` [done] ${file}`);
} catch (err) {
await client.query('ROLLBACK');
console.error(` [fail] ${file}:`, err);
process.exit(1);
} finally {
client.release();
}
}
await pool.end();
console.log('Migrations complete.');
}
run().catch(console.error);

View File

@@ -0,0 +1,116 @@
export interface Retailer {
id: string;
slug: string;
name: string;
marketCode: string;
countryCode: string;
currencyCode: string;
adapterKey: string;
baseUrl: string;
active: boolean;
createdAt: Date;
updatedAt: Date;
}
export interface RetailerTarget {
id: string;
retailerId: string;
targetType: 'category_url' | 'product_url' | 'search_query';
targetRef: string;
categorySlug: string;
enabled: boolean;
lastScrapedAt: Date | null;
}
export interface CanonicalProduct {
id: string;
canonicalName: string;
brandNorm: string | null;
category: string;
variantNorm: string | null;
sizeValue: number | null;
sizeUnit: string | null;
baseQuantity: number | null;
baseUnit: string | null;
active: boolean;
createdAt: Date;
}
export interface RetailerProduct {
id: string;
retailerId: string;
retailerSku: string | null;
canonicalProductId: string | null;
sourceUrl: string;
rawTitle: string;
rawBrand: string | null;
rawSizeText: string | null;
imageUrl: string | null;
categoryText: string | null;
firstSeenAt: Date;
lastSeenAt: Date;
active: boolean;
}
export interface PriceObservation {
id: string;
retailerProductId: string;
scrapeRunId: string;
observedAt: Date;
price: number;
listPrice: number | null;
promoPrice: number | null;
currencyCode: string;
unitPrice: number | null;
unitBasisQty: number | null;
unitBasisUnit: string | null;
inStock: boolean;
promoText: string | null;
rawPayloadJson: Record<string, unknown>;
rawHash: string;
}
export interface ScrapeRun {
id: string;
retailerId: string;
startedAt: Date;
finishedAt: Date | null;
status: 'running' | 'completed' | 'failed' | 'partial';
triggerType: 'scheduled' | 'manual';
pagesAttempted: number;
pagesSucceeded: number;
errorsCount: number;
configVersion: string;
}
export interface ProductMatch {
id: string;
retailerProductId: string;
canonicalProductId: string;
basketItemId: string | null;
matchScore: number;
matchStatus: 'auto' | 'review' | 'approved' | 'rejected';
evidenceJson: Record<string, unknown>;
reviewedBy: string | null;
reviewedAt: Date | null;
}
export interface ComputedIndex {
id: string;
basketId: string;
retailerId: string | null;
category: string | null;
metricDate: Date;
metricKey: string;
metricValue: number;
methodologyVersion: string;
}
export interface DataSourceHealth {
retailerId: string;
lastSuccessfulRunAt: Date | null;
lastRunStatus: string | null;
parseSuccessRate: number | null;
avgFreshnessMinutes: number | null;
updatedAt: Date;
}

View File

@@ -0,0 +1,38 @@
import { query } from '../client.js';
export async function upsertProductMatch(input: {
retailerProductId: string;
canonicalProductId: string;
basketItemId: string;
matchScore: number;
matchStatus: 'auto' | 'approved';
}): Promise<void> {
await query(
`INSERT INTO product_matches
(retailer_product_id, canonical_product_id, basket_item_id, match_score, match_status, evidence_json)
VALUES ($1,$2,$3,$4,$5,'{}')
ON CONFLICT (retailer_product_id, canonical_product_id)
DO UPDATE SET
basket_item_id = EXCLUDED.basket_item_id,
match_score = EXCLUDED.match_score,
match_status = EXCLUDED.match_status`,
[
input.retailerProductId,
input.canonicalProductId,
input.basketItemId,
input.matchScore,
input.matchStatus,
],
);
}
export async function getBasketItemId(basketSlug: string, category: string): Promise<string | null> {
const result = await query<{ id: string }>(
`SELECT bi.id FROM basket_items bi
JOIN baskets b ON b.id = bi.basket_id
WHERE b.slug = $1 AND bi.category = $2 AND bi.active = true
LIMIT 1`,
[basketSlug, category],
);
return result.rows[0]?.id ?? null;
}

View File

@@ -0,0 +1,91 @@
import { createHash } from 'node:crypto';
import { query } from '../client.js';
import type { PriceObservation } from '../models.js';
export interface InsertObservationInput {
retailerProductId: string;
scrapeRunId: string;
price: number;
listPrice?: number | null;
promoPrice?: number | null;
currencyCode: string;
unitPrice?: number | null;
unitBasisQty?: number | null;
unitBasisUnit?: string | null;
inStock?: boolean;
promoText?: string | null;
rawPayloadJson: Record<string, unknown>;
}
export function hashPayload(payload: Record<string, unknown>): string {
return createHash('sha256').update(JSON.stringify(payload)).digest('hex').slice(0, 64);
}
export async function insertObservation(input: InsertObservationInput): Promise<string> {
const rawHash = hashPayload(input.rawPayloadJson);
const existing = await query<{ id: string }>(
`SELECT id FROM price_observations WHERE retailer_product_id = $1 AND raw_hash = $2 ORDER BY observed_at DESC LIMIT 1`,
[input.retailerProductId, rawHash],
);
if (existing.rows.length > 0) return existing.rows[0].id;
const result = await query<{ id: string }>(
`INSERT INTO price_observations
(retailer_product_id, scrape_run_id, observed_at, price, list_price, promo_price,
currency_code, unit_price, unit_basis_qty, unit_basis_unit, in_stock, promo_text,
raw_payload_json, raw_hash)
VALUES ($1,$2,NOW(),$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13)
RETURNING id`,
[
input.retailerProductId,
input.scrapeRunId,
input.price,
input.listPrice ?? null,
input.promoPrice ?? null,
input.currencyCode,
input.unitPrice ?? null,
input.unitBasisQty ?? null,
input.unitBasisUnit ?? null,
input.inStock ?? true,
input.promoText ?? null,
JSON.stringify(input.rawPayloadJson),
rawHash,
],
);
return result.rows[0].id;
}
export async function getLatestObservations(
retailerProductIds: string[],
): Promise<PriceObservation[]> {
if (retailerProductIds.length === 0) return [];
const result = await query<PriceObservation>(
`SELECT DISTINCT ON (retailer_product_id) *
FROM price_observations
WHERE retailer_product_id = ANY($1) AND in_stock = true
ORDER BY retailer_product_id, observed_at DESC`,
[retailerProductIds],
);
return result.rows;
}
export async function getPriceHistory(
retailerProductId: string,
daysBack: number,
): Promise<Array<{ date: Date; price: number; unitPrice: number | null }>> {
const result = await query<{ date: Date; price: number; unit_price: number | null }>(
`SELECT date_trunc('day', observed_at) AS date,
AVG(price)::numeric(12,2) AS price,
AVG(unit_price)::numeric(12,4) AS unit_price
FROM price_observations
WHERE retailer_product_id = $1
AND observed_at > NOW() - ($2 || ' days')::INTERVAL
AND in_stock = true
GROUP BY 1
ORDER BY 1`,
[retailerProductId, daysBack],
);
return result.rows.map((r) => ({ date: r.date, price: r.price, unitPrice: r.unit_price }));
}

View File

@@ -0,0 +1,88 @@
import { query } from '../client.js';
import type { CanonicalProduct, RetailerProduct } from '../models.js';
export async function upsertRetailerProduct(input: {
retailerId: string;
retailerSku: string | null;
sourceUrl: string;
rawTitle: string;
rawBrand?: string | null;
rawSizeText?: string | null;
imageUrl?: string | null;
categoryText?: string | null;
}): Promise<string> {
const result = await query<{ id: string }>(
`INSERT INTO retailer_products
(retailer_id, retailer_sku, source_url, raw_title, raw_brand, raw_size_text,
image_url, category_text, first_seen_at, last_seen_at)
VALUES ($1,$2,$3,$4,$5,$6,$7,$8,NOW(),NOW())
ON CONFLICT (retailer_id, source_url) DO UPDATE
SET raw_title = EXCLUDED.raw_title,
raw_brand = EXCLUDED.raw_brand,
raw_size_text = EXCLUDED.raw_size_text,
image_url = EXCLUDED.image_url,
category_text = EXCLUDED.category_text,
last_seen_at = NOW()
RETURNING id`,
[
input.retailerId,
input.retailerSku ?? null,
input.sourceUrl,
input.rawTitle,
input.rawBrand ?? null,
input.rawSizeText ?? null,
input.imageUrl ?? null,
input.categoryText ?? null,
],
);
return result.rows[0].id;
}
export async function getRetailerProductsByRetailer(retailerId: string): Promise<RetailerProduct[]> {
const result = await query<RetailerProduct>(
`SELECT * FROM retailer_products WHERE retailer_id = $1 AND active = true`,
[retailerId],
);
return result.rows;
}
export async function getCanonicalProducts(marketCode?: string): Promise<CanonicalProduct[]> {
const result = await query<CanonicalProduct>(
`SELECT * FROM canonical_products WHERE active = true ORDER BY canonical_name`,
[],
);
return result.rows;
}
export async function upsertCanonicalProduct(input: {
canonicalName: string;
brandNorm?: string | null;
category: string;
variantNorm?: string | null;
sizeValue?: number | null;
sizeUnit?: string | null;
baseQuantity?: number | null;
baseUnit?: string | null;
}): Promise<string> {
const result = await query<{ id: string }>(
`INSERT INTO canonical_products
(canonical_name, brand_norm, category, variant_norm, size_value, size_unit,
base_quantity, base_unit)
VALUES ($1,$2,$3,$4,$5,$6,$7,$8)
ON CONFLICT (canonical_name, brand_norm, category, variant_norm, size_value, size_unit)
DO UPDATE SET base_quantity = EXCLUDED.base_quantity, base_unit = EXCLUDED.base_unit
RETURNING id`,
[
input.canonicalName,
input.brandNorm ?? null,
input.category,
input.variantNorm ?? null,
input.sizeValue ?? null,
input.sizeUnit ?? null,
input.baseQuantity ?? null,
input.baseUnit ?? null,
],
);
return result.rows[0].id;
}

View File

@@ -0,0 +1,208 @@
/**
* Aggregate job: computes basket indices from latest price observations.
* Produces Fixed Basket Index and Value Basket Index per methodology.
*/
import { query } from '../db/client.js';
import { loadAllBasketConfigs } from '../config/loader.js';
const logger = {
info: (msg: string, ...args: unknown[]) => console.log(`[aggregate] ${msg}`, ...args),
warn: (msg: string, ...args: unknown[]) => console.warn(`[aggregate] ${msg}`, ...args),
};
interface BasketRow {
basketItemId: string;
category: string;
weight: number;
retailerProductId: string;
retailerSlug: string;
price: number;
unitPrice: number | null;
currencyCode: string;
observedAt: Date;
}
async function getBasketRows(basketSlug: string, marketCode: string): Promise<BasketRow[]> {
const result = await query<{
basket_item_id: string;
category: string;
weight: string;
retailer_product_id: string;
retailer_slug: string;
price: string;
unit_price: string | null;
currency_code: string;
observed_at: Date;
}>(
`SELECT bi.id AS basket_item_id,
bi.category,
bi.weight,
rp.id AS retailer_product_id,
r.slug AS retailer_slug,
po.price,
po.unit_price,
po.currency_code,
po.observed_at
FROM baskets b
JOIN basket_items bi ON bi.basket_id = b.id AND bi.active = true
JOIN product_matches pm ON pm.basket_item_id = bi.id AND pm.match_status IN ('auto','approved')
JOIN retailer_products rp ON rp.id = pm.retailer_product_id AND rp.active = true
JOIN retailers r ON r.id = rp.retailer_id AND r.market_code = $2 AND r.active = true
JOIN LATERAL (
SELECT price, unit_price, currency_code, observed_at
FROM price_observations
WHERE retailer_product_id = rp.id AND in_stock = true
ORDER BY observed_at DESC LIMIT 1
) po ON true
WHERE b.slug = $1`,
[basketSlug, marketCode],
);
return result.rows.map((r) => ({
basketItemId: r.basket_item_id,
category: r.category,
weight: parseFloat(r.weight),
retailerProductId: r.retailer_product_id,
retailerSlug: r.retailer_slug,
price: parseFloat(r.price),
unitPrice: r.unit_price ? parseFloat(r.unit_price) : null,
currencyCode: r.currency_code,
observedAt: r.observed_at,
}));
}
async function getBaselinePrices(basketItemIds: string[], baseDate: string): Promise<Map<string, number>> {
const result = await query<{ basket_item_id: string; price: string }>(
`SELECT pm.basket_item_id, AVG(po.price)::numeric(12,2) AS price
FROM price_observations po
JOIN product_matches pm ON pm.retailer_product_id = po.retailer_product_id
WHERE pm.basket_item_id = ANY($1)
AND po.in_stock = true
AND DATE_TRUNC('day', po.observed_at) = $2::date
GROUP BY pm.basket_item_id`,
[basketItemIds, baseDate],
);
const map = new Map<string, number>();
for (const row of result.rows) {
map.set(row.basket_item_id, parseFloat(row.price));
}
return map;
}
function computeFixedIndex(rows: BasketRow[], baselines: Map<string, number>): number {
let weightedSum = 0;
let totalWeight = 0;
const byItem = new Map<string, BasketRow[]>();
for (const r of rows) {
if (!byItem.has(r.basketItemId)) byItem.set(r.basketItemId, []);
byItem.get(r.basketItemId)!.push(r);
}
for (const [itemId, itemRows] of byItem) {
const base = baselines.get(itemId);
if (!base) continue;
const avgPrice = itemRows.reduce((s, r) => s + r.price, 0) / itemRows.length;
const weight = itemRows[0].weight;
weightedSum += weight * (avgPrice / base);
totalWeight += weight;
}
if (totalWeight === 0) return 100;
return 100 * (weightedSum / totalWeight);
}
function computeValueIndex(rows: BasketRow[], baselines: Map<string, number>): number {
// Value index: same as fixed index but using the cheapest available price
// per basket item (floor price across retailers), not the average.
const byItem = new Map<string, BasketRow[]>();
for (const r of rows) {
if (!byItem.has(r.basketItemId)) byItem.set(r.basketItemId, []);
byItem.get(r.basketItemId)!.push(r);
}
let weightedSum = 0;
let totalWeight = 0;
for (const [itemId, itemRows] of byItem) {
const base = baselines.get(itemId);
if (!base) continue;
const floorPrice = itemRows.reduce((min, r) => Math.min(min, r.price), Infinity);
const weight = itemRows[0].weight;
weightedSum += weight * (floorPrice / base);
totalWeight += weight;
}
if (totalWeight === 0) return 100;
return 100 * (weightedSum / totalWeight);
}
async function writeComputedIndex(
basketId: string,
retailerId: string | null,
category: string | null,
metricKey: string,
metricValue: number,
) {
await query(
`INSERT INTO computed_indices (basket_id, retailer_id, category, metric_date, metric_key, metric_value, methodology_version)
VALUES ($1,$2,$3,NOW()::date,$4,$5,'1')
ON CONFLICT (basket_id, retailer_id, category, metric_date, metric_key)
DO UPDATE SET metric_value = EXCLUDED.metric_value, methodology_version = EXCLUDED.methodology_version`,
[basketId, retailerId, category, metricKey, metricValue],
);
}
export async function aggregateBasket(basketSlug: string, marketCode: string) {
const configs = loadAllBasketConfigs();
const basketConfig = configs.find((b) => b.slug === basketSlug && b.marketCode === marketCode);
if (!basketConfig) {
logger.warn(`Basket ${basketSlug}:${marketCode} not found in config`);
return;
}
const basketResult = await query<{ id: string }>(`SELECT id FROM baskets WHERE slug = $1`, [basketSlug]);
if (!basketResult.rows.length) {
logger.warn(`Basket ${basketSlug} not found in DB — run seed first`);
return;
}
const basketId = basketResult.rows[0].id;
const rows = await getBasketRows(basketSlug, marketCode);
if (rows.length === 0) {
logger.warn(`No matched products for ${basketSlug}:${marketCode}`);
return;
}
const uniqueItemIds = [...new Set(rows.map((r) => r.basketItemId))];
const baselines = await getBaselinePrices(uniqueItemIds, basketConfig.baseDate);
const essentialsIndex = computeFixedIndex(rows, baselines);
const valueIndex = computeValueIndex(rows, baselines);
const coverageCount = new Set(rows.map((r) => r.basketItemId)).size;
const totalItems = basketConfig.items.length;
const coveragePct = (coverageCount / totalItems) * 100;
await writeComputedIndex(basketId, null, null, 'essentials_index', essentialsIndex);
await writeComputedIndex(basketId, null, null, 'value_index', valueIndex);
await writeComputedIndex(basketId, null, null, 'coverage_pct', coveragePct);
logger.info(`${basketSlug}:${marketCode} essentials=${essentialsIndex.toFixed(2)} value=${valueIndex.toFixed(2)} coverage=${coveragePct.toFixed(1)}%`);
}
export async function aggregateAll() {
const configs = loadAllBasketConfigs();
for (const c of configs) {
await aggregateBasket(c.slug, c.marketCode);
}
}
if (import.meta.url === `file://${process.argv[1]}`) {
aggregateAll().catch(console.error);
}

View File

@@ -0,0 +1,133 @@
/**
* Publish job: builds compact WorldMonitor snapshot payloads and writes to Redis.
* This is the handoff point between consumer-prices-core and WorldMonitor.
*/
import { createClient } from 'redis';
import {
buildBasketSeriesSnapshot,
buildCategoriesSnapshot,
buildFreshnessSnapshot,
buildMoversSnapshot,
buildOverviewSnapshot,
buildRetailerSpreadSnapshot,
} from '../snapshots/worldmonitor.js';
import { loadAllBasketConfigs, loadAllRetailerConfigs } from '../config/loader.js';
const logger = {
info: (msg: string, ...args: unknown[]) => console.log(`[publish] ${msg}`, ...args),
warn: (msg: string, ...args: unknown[]) => console.warn(`[publish] ${msg}`, ...args),
error: (msg: string, ...args: unknown[]) => console.error(`[publish] ${msg}`, ...args),
};
function makeKey(parts: string[]): string {
return parts.join(':');
}
function recordCount(data: unknown): number {
if (!data || typeof data !== 'object') return 1;
const d = data as Record<string, unknown>;
const arr = d.retailers ?? d.risers ?? d.essentialsSeries ?? d.categories;
return Array.isArray(arr) ? arr.length : 1;
}
async function writeSnapshot(
redis: ReturnType<typeof createClient>,
key: string,
data: unknown,
ttlSeconds: number,
) {
const json = JSON.stringify(data);
await redis.setEx(key, ttlSeconds, json);
await redis.setEx(
makeKey(['seed-meta', key]),
ttlSeconds * 2,
JSON.stringify({ fetchedAt: Date.now(), recordCount: recordCount(data) }),
);
logger.info(` wrote ${key} (${json.length} bytes, ttl=${ttlSeconds}s)`);
}
export async function publishAll() {
const redisUrl = process.env.REDIS_URL;
if (!redisUrl) throw new Error('REDIS_URL is not set');
const redis = createClient({ url: redisUrl });
await redis.connect();
try {
const retailers = loadAllRetailerConfigs().filter((r) => r.enabled);
const markets = [...new Set(retailers.map((r) => r.marketCode))];
const baskets = loadAllBasketConfigs();
for (const marketCode of markets) {
logger.info(`Publishing snapshots for market: ${marketCode}`);
try {
const overview = await buildOverviewSnapshot(marketCode);
await writeSnapshot(redis, makeKey(['consumer-prices', 'overview', marketCode]), overview, 1800);
} catch (err) {
logger.error(`overview:${marketCode} failed: ${err}`);
}
for (const days of [7, 30]) {
try {
const movers = await buildMoversSnapshot(marketCode, days);
await writeSnapshot(redis, makeKey(['consumer-prices', 'movers', marketCode, `${days}d`]), movers, 1800);
} catch (err) {
logger.error(`movers:${marketCode}:${days}d failed: ${err}`);
}
}
try {
const freshness = await buildFreshnessSnapshot(marketCode);
await writeSnapshot(redis, makeKey(['consumer-prices', 'freshness', marketCode]), freshness, 600);
} catch (err) {
logger.error(`freshness:${marketCode} failed: ${err}`);
}
for (const range of ['7d', '30d', '90d']) {
try {
const categories = await buildCategoriesSnapshot(marketCode, range);
await writeSnapshot(redis, makeKey(['consumer-prices', 'categories', marketCode, range]), categories, 1800);
} catch (err) {
logger.error(`categories:${marketCode}:${range} failed: ${err}`);
}
}
for (const basket of baskets.filter((b) => b.marketCode === marketCode)) {
try {
const spread = await buildRetailerSpreadSnapshot(marketCode, basket.slug);
await writeSnapshot(
redis,
makeKey(['consumer-prices', 'retailer-spread', marketCode, basket.slug]),
spread,
1800,
);
} catch (err) {
logger.error(`spread:${marketCode}:${basket.slug} failed: ${err}`);
}
for (const range of ['7d', '30d', '90d']) {
try {
const series = await buildBasketSeriesSnapshot(marketCode, basket.slug, range);
await writeSnapshot(
redis,
makeKey(['consumer-prices', 'basket-series', marketCode, basket.slug, range]),
series,
3600,
);
} catch (err) {
logger.error(`basket-series:${marketCode}:${basket.slug}:${range} failed: ${err}`);
}
}
}
}
logger.info('Publish complete');
} finally {
await redis.disconnect();
}
}
if (import.meta.url === `file://${process.argv[1]}`) {
publishAll().catch(console.error);
}

View File

@@ -0,0 +1,187 @@
/**
* Scrape job: discovers targets and writes price observations to Postgres.
* Respects per-retailer rate limits and acquisition provider config.
*/
import { query } from '../db/client.js';
import { insertObservation } from '../db/queries/observations.js';
import { upsertRetailerProduct } from '../db/queries/products.js';
import { parseSize, unitPrice as calcUnitPrice } from '../normalizers/size.js';
import { loadAllRetailerConfigs, loadRetailerConfig } from '../config/loader.js';
import { initProviders, teardownAll } from '../acquisition/registry.js';
import { GenericPlaywrightAdapter } from '../adapters/generic.js';
import { ExaSearchAdapter } from '../adapters/exa-search.js';
import type { AdapterContext } from '../adapters/types.js';
import { upsertCanonicalProduct } from '../db/queries/products.js';
import { getBasketItemId, upsertProductMatch } from '../db/queries/matches.js';
const logger = {
info: (msg: string, ...args: unknown[]) => console.log(`[scrape] ${msg}`, ...args),
warn: (msg: string, ...args: unknown[]) => console.warn(`[scrape] ${msg}`, ...args),
error: (msg: string, ...args: unknown[]) => console.error(`[scrape] ${msg}`, ...args),
};
async function sleep(ms: number) {
return new Promise((r) => setTimeout(r, ms));
}
async function getOrCreateRetailer(slug: string, config: ReturnType<typeof loadRetailerConfig>) {
const existing = await query<{ id: string }>(`SELECT id FROM retailers WHERE slug = $1`, [slug]);
if (existing.rows.length > 0) return existing.rows[0].id;
const result = await query<{ id: string }>(
`INSERT INTO retailers (slug, name, market_code, country_code, currency_code, adapter_key, base_url)
VALUES ($1,$2,$3,$3,$4,$5,$6) RETURNING id`,
[slug, config.name, config.marketCode, config.currencyCode, config.adapter, config.baseUrl],
);
return result.rows[0].id;
}
async function createScrapeRun(retailerId: string): Promise<string> {
const result = await query<{ id: string }>(
`INSERT INTO scrape_runs (retailer_id, started_at, status, trigger_type, pages_attempted, pages_succeeded, errors_count, config_version)
VALUES ($1, NOW(), 'running', 'scheduled', 0, 0, 0, '1') RETURNING id`,
[retailerId],
);
return result.rows[0].id;
}
async function updateScrapeRun(
runId: string,
status: string,
pagesAttempted: number,
pagesSucceeded: number,
errorsCount: number,
) {
await query(
`UPDATE scrape_runs SET status=$2, finished_at=NOW(), pages_attempted=$3, pages_succeeded=$4, errors_count=$5 WHERE id=$1`,
[runId, status, pagesAttempted, pagesSucceeded, errorsCount],
);
}
export async function scrapeRetailer(slug: string) {
initProviders(process.env as Record<string, string>);
const config = loadRetailerConfig(slug);
if (!config.enabled) {
logger.info(`${slug} is disabled, skipping`);
return;
}
const retailerId = await getOrCreateRetailer(slug, config);
const runId = await createScrapeRun(retailerId);
logger.info(`Run ${runId} started for ${slug}`);
const adapter =
config.adapter === 'exa-search'
? new ExaSearchAdapter((process.env.EXA_API_KEYS || process.env.EXA_API_KEY || '').split(/[\n,]+/)[0].trim())
: new GenericPlaywrightAdapter();
const ctx: AdapterContext = { config, runId, logger };
const targets = await adapter.discoverTargets(ctx);
logger.info(`Discovered ${targets.length} targets`);
let pagesAttempted = 0;
let pagesSucceeded = 0;
let errorsCount = 0;
const delay = config.rateLimit?.delayBetweenRequestsMs ?? 2_000;
for (const target of targets) {
pagesAttempted++;
try {
const fetchResult = await adapter.fetchTarget(ctx, target);
const products = await adapter.parseListing(ctx, fetchResult);
logger.info(` [${target.id}] parsed ${products.length} products`);
for (const product of products) {
const productId = await upsertRetailerProduct({
retailerId,
retailerSku: product.retailerSku,
sourceUrl: product.sourceUrl,
rawTitle: product.rawTitle,
rawBrand: product.rawBrand,
rawSizeText: product.rawSizeText,
imageUrl: product.imageUrl,
categoryText: product.categoryText ?? target.category,
});
const parsed = parseSize(product.rawSizeText);
const up = parsed ? calcUnitPrice(product.price, parsed) : null;
await insertObservation({
retailerProductId: productId,
scrapeRunId: runId,
price: product.price,
listPrice: product.listPrice,
promoPrice: product.promoPrice,
currencyCode: config.currencyCode,
unitPrice: up,
unitBasisQty: parsed?.baseQuantity ?? null,
unitBasisUnit: parsed?.baseUnit ?? null,
inStock: product.inStock,
promoText: product.promoText,
rawPayloadJson: product.rawPayload,
});
// For exa-search adapter: auto-create product → basket match since we
// searched for a specific basket item (no ambiguity in what was scraped).
if (
config.adapter === 'exa-search' &&
product.rawPayload.basketSlug &&
product.rawPayload.itemCategory
) {
try {
const canonicalId = await upsertCanonicalProduct({
canonicalName: (product.rawPayload.canonicalName as string) || product.rawTitle,
category: product.categoryText ?? target.category,
});
const basketItemId = await getBasketItemId(
product.rawPayload.basketSlug as string,
product.rawPayload.itemCategory as string,
);
if (basketItemId) {
await upsertProductMatch({
retailerProductId: productId,
canonicalProductId: canonicalId,
basketItemId,
matchScore: 1.0,
matchStatus: 'auto',
});
}
} catch (matchErr) {
logger.warn(` [${target.id}] product match failed: ${matchErr}`);
}
}
}
pagesSucceeded++;
} catch (err) {
errorsCount++;
logger.error(` [${target.id}] failed: ${err}`);
}
if (pagesAttempted < targets.length) await sleep(delay);
}
const status = errorsCount === 0 ? 'completed' : pagesSucceeded > 0 ? 'partial' : 'failed';
await updateScrapeRun(runId, status, pagesAttempted, pagesSucceeded, errorsCount);
logger.info(`Run ${runId} finished: ${status} (${pagesSucceeded}/${pagesAttempted} pages)`);
await teardownAll();
}
export async function scrapeAll() {
initProviders(process.env as Record<string, string>);
const configs = loadAllRetailerConfigs().filter((c) => c.enabled);
logger.info(`Scraping ${configs.length} retailers`);
for (const c of configs) {
await scrapeRetailer(c.slug);
}
}
if (process.argv[2]) {
scrapeRetailer(process.argv[2]).catch(console.error);
} else {
scrapeAll().catch(console.error);
}

View File

@@ -0,0 +1,88 @@
import { normalizeBrand } from '../normalizers/brand.js';
import { parseSize } from '../normalizers/size.js';
import { tokenOverlap } from '../normalizers/title.js';
import type { CanonicalProduct } from '../db/models.js';
export interface RawProduct {
rawTitle: string;
rawBrand?: string | null;
rawSizeText?: string | null;
categoryText?: string | null;
}
export interface MatchResult {
canonicalProductId: string;
score: number;
status: 'auto' | 'review' | 'reject';
evidence: {
brandExact: boolean;
categoryExact: boolean;
titleOverlap: number;
sizeExact: boolean;
sizeClose: boolean;
packCountMatch: boolean;
unitPriceRatioOk: boolean;
};
}
export function scoreMatch(raw: RawProduct, canonical: CanonicalProduct): MatchResult {
let score = 0;
const rawBrandNorm = normalizeBrand(raw.rawBrand)?.toLowerCase();
const canonBrandNorm = canonical.brandNorm?.toLowerCase();
const brandExact = !!(rawBrandNorm && canonBrandNorm && rawBrandNorm === canonBrandNorm);
if (brandExact) score += 30;
const rawCategory = (raw.categoryText ?? '').toLowerCase();
const canonCategory = canonical.category.toLowerCase();
const categoryExact = rawCategory.includes(canonCategory) || canonCategory.includes(rawCategory);
if (categoryExact) score += 20;
const overlap = tokenOverlap(raw.rawTitle, canonical.canonicalName);
score += Math.round(overlap * 15);
const rawParsed = parseSize(raw.rawSizeText);
const canonHasSize = canonical.sizeValue !== null;
let sizeExact = false;
let sizeClose = false;
let packCountMatch = false;
if (rawParsed && canonHasSize && canonical.baseUnit === rawParsed.baseUnit) {
const ratio = rawParsed.baseQuantity / (canonical.baseQuantity ?? rawParsed.baseQuantity);
sizeExact = Math.abs(ratio - 1) < 0.01;
sizeClose = Math.abs(ratio - 1) < 0.05;
packCountMatch = rawParsed.packCount === 1;
if (sizeExact) score += 20;
else if (sizeClose) score += 10;
if (packCountMatch) score += 10;
}
const status: MatchResult['status'] = score >= 85 ? 'auto' : score >= 70 ? 'review' : 'reject';
return {
canonicalProductId: canonical.id,
score,
status,
evidence: {
brandExact,
categoryExact,
titleOverlap: overlap,
sizeExact,
sizeClose,
packCountMatch,
unitPriceRatioOk: false,
},
};
}
export function bestMatch(raw: RawProduct, candidates: CanonicalProduct[]): MatchResult | null {
if (candidates.length === 0) return null;
const scored = candidates.map((c) => scoreMatch(raw, c));
scored.sort((a, b) => b.score - a.score);
const best = scored[0];
return best.status === 'reject' ? null : best;
}

View File

@@ -0,0 +1,46 @@
import { readFileSync, existsSync } from 'node:fs';
import { join, dirname } from 'node:path';
import { fileURLToPath } from 'node:url';
interface BrandAliases {
aliases: Record<string, string[]>;
}
let _aliases: Map<string, string> | null = null;
function loadAliases(): Map<string, string> {
if (_aliases) return _aliases;
const filePath = join(dirname(fileURLToPath(import.meta.url)), '../../../configs/brands/aliases.json');
const map = new Map<string, string>();
if (existsSync(filePath)) {
const data = JSON.parse(readFileSync(filePath, 'utf8')) as BrandAliases;
for (const [canonical, variants] of Object.entries(data.aliases)) {
for (const v of variants) {
map.set(v.toLowerCase(), canonical);
}
}
}
_aliases = map;
return map;
}
export function normalizeBrand(raw: string | null | undefined): string | null {
if (!raw) return null;
const cleaned = raw
.trim()
.toLowerCase()
.replace(/[^\w\s]/g, ' ')
.replace(/\s+/g, ' ')
.trim();
const aliases = loadAliases();
return aliases.get(cleaned) ?? titleCase(cleaned);
}
function titleCase(s: string): string {
return s.replace(/\b\w/g, (c) => c.toUpperCase());
}

View File

@@ -0,0 +1,84 @@
/**
* Parses and normalizes product size strings into base units.
* Handles patterns like: 2x200g, 6x1L, 500ml, 24 rolls, 3 ct, 1kg, 12 pods
*/
export interface ParsedSize {
packCount: number;
sizeValue: number;
sizeUnit: string;
baseQuantity: number;
baseUnit: string;
rawText: string;
}
const UNIT_MAP: Record<string, { base: string; factor: number }> = {
kg: { base: 'g', factor: 1000 },
g: { base: 'g', factor: 1 },
mg: { base: 'g', factor: 0.001 },
l: { base: 'ml', factor: 1000 },
lt: { base: 'ml', factor: 1000 },
ltr: { base: 'ml', factor: 1000 },
litre: { base: 'ml', factor: 1000 },
liter: { base: 'ml', factor: 1000 },
ml: { base: 'ml', factor: 1 },
cl: { base: 'ml', factor: 10 },
oz: { base: 'g', factor: 28.3495 },
lb: { base: 'g', factor: 453.592 },
ct: { base: 'ct', factor: 1 },
pc: { base: 'ct', factor: 1 },
pcs: { base: 'ct', factor: 1 },
piece: { base: 'ct', factor: 1 },
pieces: { base: 'ct', factor: 1 },
roll: { base: 'ct', factor: 1 },
rolls: { base: 'ct', factor: 1 },
pod: { base: 'ct', factor: 1 },
pods: { base: 'ct', factor: 1 },
sheet: { base: 'ct', factor: 1 },
sheets: { base: 'ct', factor: 1 },
sachet: { base: 'ct', factor: 1 },
sachets: { base: 'ct', factor: 1 },
};
const PACK_PATTERN = /^(\d+)\s*[x×]\s*(.+)$/i;
const SIZE_PATTERN = /(\d+(?:\.\d+)?)\s*([a-z]+)/i;
export function parseSize(raw: string | null | undefined): ParsedSize | null {
if (!raw) return null;
const text = raw.trim().toLowerCase();
let packCount = 1;
let sizeStr = text;
const packMatch = PACK_PATTERN.exec(text);
if (packMatch) {
packCount = parseInt(packMatch[1], 10);
sizeStr = packMatch[2].trim();
}
const sizeMatch = SIZE_PATTERN.exec(sizeStr);
if (!sizeMatch) return null;
const sizeValue = parseFloat(sizeMatch[1]);
const rawUnit = sizeMatch[2].toLowerCase().replace(/\.$/, '');
const unitDef = UNIT_MAP[rawUnit];
if (!unitDef) return null;
const baseQuantity = packCount * sizeValue * unitDef.factor;
return {
packCount,
sizeValue,
sizeUnit: rawUnit,
baseQuantity,
baseUnit: unitDef.base,
rawText: raw,
};
}
export function unitPrice(price: number, size: ParsedSize): number {
if (size.baseQuantity === 0) return price;
return price / size.baseQuantity;
}

View File

@@ -0,0 +1,37 @@
const PROMO_TOKENS = new Set([
'fresh', 'save', 'sale', 'deal', 'offer', 'limited', 'new', 'best', 'value',
'buy', 'get', 'free', 'bonus', 'extra', 'special', 'exclusive', 'online only',
'website exclusive', 'price drop', 'clearance', 'now', 'only',
]);
const STOP_WORDS = new Set(['a', 'an', 'the', 'with', 'and', 'or', 'in', 'of', 'for', 'to', 'by']);
export function cleanTitle(raw: string): string {
return raw
.trim()
.toLowerCase()
.replace(/\s+/g, ' ')
.replace(/[^\w\s\-&]/g, ' ')
.split(' ')
.filter((t) => t.length > 1 && !PROMO_TOKENS.has(t))
.join(' ')
.trim();
}
export function titleTokens(title: string): string[] {
return cleanTitle(title)
.split(' ')
.filter((t) => t.length > 2 && !STOP_WORDS.has(t));
}
export function tokenOverlap(a: string, b: string): number {
const ta = new Set(titleTokens(a));
const tb = new Set(titleTokens(b));
if (ta.size === 0 || tb.size === 0) return 0;
let shared = 0;
for (const t of ta) {
if (tb.has(t)) shared++;
}
return shared / Math.min(ta.size, tb.size);
}

View File

@@ -0,0 +1,555 @@
/**
* Builds compact WorldMonitor-ready snapshot payloads from computed indices.
* All types are shaped to match the proto-generated TypeScript interfaces so
* snapshots can be written to Redis and read directly by WorldMonitor handlers.
*/
import { query } from '../db/client.js';
// ---------------------------------------------------------------------------
// Snapshot interfaces — mirror proto-generated response types exactly
// (asOf is int64 → string per protobuf JSON mapping)
// ---------------------------------------------------------------------------
export interface WMCategorySnapshot {
slug: string;
name: string;
wowPct: number;
momPct: number;
currentIndex: number;
sparkline: number[];
coveragePct: number;
itemCount: number;
}
export interface WMOverviewSnapshot {
marketCode: string;
asOf: string;
currencyCode: string;
essentialsIndex: number;
valueBasketIndex: number;
wowPct: number;
momPct: number;
retailerSpreadPct: number;
coveragePct: number;
freshnessLagMin: number;
topCategories: WMCategorySnapshot[];
upstreamUnavailable: false;
}
export interface WMPriceMover {
productId: string;
title: string;
category: string;
retailerSlug: string;
changePct: number;
currentPrice: number;
currencyCode: string;
}
export interface WMMoversSnapshot {
marketCode: string;
asOf: string;
range: string;
risers: WMPriceMover[];
fallers: WMPriceMover[];
upstreamUnavailable: false;
}
export interface WMRetailerSpread {
slug: string;
name: string;
basketTotal: number;
deltaVsCheapest: number;
deltaVsCheapestPct: number;
itemCount: number;
freshnessMin: number;
currencyCode: string;
}
export interface WMRetailerSpreadSnapshot {
marketCode: string;
asOf: string;
basketSlug: string;
currencyCode: string;
retailers: WMRetailerSpread[];
spreadPct: number;
upstreamUnavailable: false;
}
export interface WMRetailerFreshness {
slug: string;
name: string;
lastRunAt: string;
status: string;
parseSuccessRate: number;
freshnessMin: number;
}
export interface WMFreshnessSnapshot {
marketCode: string;
asOf: string;
retailers: WMRetailerFreshness[];
overallFreshnessMin: number;
stalledCount: number;
upstreamUnavailable: false;
}
export interface WMBasketPoint {
date: string;
index: number;
}
export interface WMBasketSeriesSnapshot {
marketCode: string;
basketSlug: string;
asOf: string;
currencyCode: string;
range: string;
essentialsSeries: WMBasketPoint[];
valueSeries: WMBasketPoint[];
upstreamUnavailable: false;
}
// ---------------------------------------------------------------------------
// Private helpers
// ---------------------------------------------------------------------------
async function buildTopCategories(basketId: string): Promise<WMCategorySnapshot[]> {
const result = await query<{
category: string;
current_index: number | null;
prev_week_index: number | null;
coverage_pct: number | null;
}>(
`WITH today AS (
SELECT category, metric_key, metric_value::float AS metric_value
FROM computed_indices
WHERE basket_id = $1 AND category IS NOT NULL AND retailer_id IS NULL AND metric_date = CURRENT_DATE
),
last_week AS (
SELECT category, metric_key, metric_value::float AS metric_value
FROM computed_indices
WHERE basket_id = $1 AND category IS NOT NULL AND retailer_id IS NULL
AND metric_date = (
SELECT MAX(metric_date) FROM computed_indices
WHERE basket_id = $1 AND category IS NOT NULL
AND metric_date < CURRENT_DATE - INTERVAL '6 days'
)
)
SELECT
t.category,
MAX(CASE WHEN t.metric_key = 'essentials_index' THEN t.metric_value END) AS current_index,
MAX(CASE WHEN lw.metric_key = 'essentials_index' THEN lw.metric_value END) AS prev_week_index,
MAX(CASE WHEN t.metric_key = 'coverage_pct' THEN t.metric_value END) AS coverage_pct
FROM (SELECT DISTINCT category FROM today) cats
JOIN today t ON t.category = cats.category
LEFT JOIN last_week lw ON lw.category = cats.category AND lw.metric_key = t.metric_key
GROUP BY cats.category
HAVING MAX(CASE WHEN t.metric_key = 'essentials_index' THEN 1 ELSE 0 END) = 1
ORDER BY ABS(COALESCE(MAX(CASE WHEN t.metric_key = 'essentials_index' THEN t.metric_value END), 100) - 100) DESC
LIMIT 8`,
[basketId],
);
return result.rows.map((r) => {
const cur = r.current_index ?? 100;
const prev = r.prev_week_index;
const wowPct = prev && prev > 0 ? Math.round(((cur - prev) / prev) * 100 * 10) / 10 : 0;
const slug = r.category
.toLowerCase()
.replace(/[^a-z0-9]+/g, '-')
.replace(/^-|-$/g, '');
return {
slug,
name: r.category.charAt(0).toUpperCase() + r.category.slice(1),
wowPct,
momPct: 0, // TODO: requires 30-day baseline per category
currentIndex: Math.round(cur * 10) / 10,
sparkline: [], // TODO: requires per-category date series query
coveragePct: Math.round((r.coverage_pct ?? 0) * 10) / 10,
itemCount: 0, // TODO: requires basket_items count query per category
};
});
}
// ---------------------------------------------------------------------------
// Public builders
// ---------------------------------------------------------------------------
export async function buildOverviewSnapshot(marketCode: string): Promise<WMOverviewSnapshot> {
const now = Date.now();
// Resolve basket id for category queries
const basketIdResult = await query<{ id: string }>(
`SELECT b.id FROM baskets b WHERE b.market_code = $1 LIMIT 1`,
[marketCode],
);
const basketId = basketIdResult.rows[0]?.id ?? null;
const [indexResult, prevWeekResult, prevMonthResult, spreadResult, currencyResult, freshnessResult] =
await Promise.all([
query<{ metric_key: string; metric_value: string }>(
`SELECT ci.metric_key, ci.metric_value
FROM computed_indices ci
JOIN baskets b ON b.id = ci.basket_id
WHERE b.market_code = $1
AND ci.retailer_id IS NULL AND ci.category IS NULL
AND ci.metric_date = (
SELECT MAX(metric_date) FROM computed_indices ci2 JOIN baskets b2 ON b2.id = ci2.basket_id
WHERE b2.market_code = $1 AND ci2.retailer_id IS NULL
)`,
[marketCode],
),
query<{ metric_key: string; metric_value: string }>(
`SELECT ci.metric_key, ci.metric_value
FROM computed_indices ci
JOIN baskets b ON b.id = ci.basket_id
WHERE b.market_code = $1
AND ci.retailer_id IS NULL AND ci.category IS NULL
AND ci.metric_date = (
SELECT MAX(metric_date) FROM computed_indices ci2 JOIN baskets b2 ON b2.id = ci2.basket_id
WHERE b2.market_code = $1 AND ci2.retailer_id IS NULL
AND ci2.metric_date < CURRENT_DATE - INTERVAL '6 days'
)`,
[marketCode],
),
query<{ metric_key: string; metric_value: string }>(
`SELECT ci.metric_key, ci.metric_value
FROM computed_indices ci
JOIN baskets b ON b.id = ci.basket_id
WHERE b.market_code = $1
AND ci.retailer_id IS NULL AND ci.category IS NULL
AND ci.metric_date = (
SELECT MAX(metric_date) FROM computed_indices ci2 JOIN baskets b2 ON b2.id = ci2.basket_id
WHERE b2.market_code = $1 AND ci2.retailer_id IS NULL
AND ci2.metric_date < CURRENT_DATE - INTERVAL '29 days'
)`,
[marketCode],
),
query<{ spread_pct: string }>(
`SELECT metric_value AS spread_pct FROM computed_indices ci
JOIN baskets b ON b.id = ci.basket_id
WHERE b.market_code = $1 AND ci.metric_key = 'retailer_spread_pct'
ORDER BY ci.metric_date DESC LIMIT 1`,
[marketCode],
),
query<{ currency_code: string }>(
`SELECT currency_code FROM retailers WHERE market_code = $1 AND active = true LIMIT 1`,
[marketCode],
),
query<{ avg_lag_min: string }>(
`SELECT AVG(EXTRACT(EPOCH FROM (NOW() - last_successful_run_at)) / 60)::int AS avg_lag_min
FROM data_source_health dsh
JOIN retailers r ON r.id = dsh.retailer_id
WHERE r.market_code = $1`,
[marketCode],
),
]);
const metrics: Record<string, number> = {};
for (const row of indexResult.rows) metrics[row.metric_key] = parseFloat(row.metric_value);
const prevWeek: Record<string, number> = {};
for (const row of prevWeekResult.rows) prevWeek[row.metric_key] = parseFloat(row.metric_value);
const prevMonth: Record<string, number> = {};
for (const row of prevMonthResult.rows) prevMonth[row.metric_key] = parseFloat(row.metric_value);
const ess = metrics.essentials_index ?? 100;
const val = metrics.value_index ?? 100;
const prevEss = prevWeek.essentials_index;
const prevMonthEss = prevMonth.essentials_index;
const wowPct = prevEss ? Math.round(((ess - prevEss) / prevEss) * 100 * 10) / 10 : 0;
const momPct = prevMonthEss ? Math.round(((ess - prevMonthEss) / prevMonthEss) * 100 * 10) / 10 : 0;
const topCategories = basketId ? await buildTopCategories(basketId) : [];
return {
marketCode,
asOf: String(now),
currencyCode: currencyResult.rows[0]?.currency_code ?? 'USD',
essentialsIndex: Math.round(ess * 10) / 10,
valueBasketIndex: Math.round(val * 10) / 10,
wowPct,
momPct,
retailerSpreadPct: spreadResult.rows[0]?.spread_pct
? Math.round(parseFloat(spreadResult.rows[0].spread_pct) * 10) / 10
: 0,
coveragePct: Math.round((metrics.coverage_pct ?? 0) * 10) / 10,
freshnessLagMin: freshnessResult.rows[0]?.avg_lag_min
? parseInt(freshnessResult.rows[0].avg_lag_min, 10)
: 0,
topCategories,
upstreamUnavailable: false,
};
}
export async function buildMoversSnapshot(
marketCode: string,
rangeDays: number,
): Promise<WMMoversSnapshot> {
const now = Date.now();
const range = `${rangeDays}d`;
const result = await query<{
product_id: string;
raw_title: string;
category_text: string;
retailer_slug: string;
current_price: string;
currency_code: string;
change_pct: string;
}>(
`WITH latest AS (
SELECT DISTINCT ON (rp.id) rp.id, rp.raw_title, rp.category_text, r.slug AS retailer_slug,
po.price, r.currency_code
FROM retailer_products rp
JOIN retailers r ON r.id = rp.retailer_id AND r.market_code = $1 AND r.active = true
JOIN price_observations po ON po.retailer_product_id = rp.id AND po.in_stock = true
ORDER BY rp.id, po.observed_at DESC
),
past AS (
SELECT DISTINCT ON (rp.id) rp.id, po.price AS past_price
FROM retailer_products rp
JOIN retailers r ON r.id = rp.retailer_id AND r.market_code = $1
JOIN price_observations po ON po.retailer_product_id = rp.id
AND po.observed_at BETWEEN NOW() - ($2 || ' days')::INTERVAL - INTERVAL '1 day'
AND NOW() - ($2 || ' days')::INTERVAL
ORDER BY rp.id, po.observed_at DESC
)
SELECT l.id AS product_id, l.raw_title, l.category_text, l.retailer_slug,
l.price AS current_price, l.currency_code,
ROUND(((l.price - p.past_price) / p.past_price * 100)::numeric, 2) AS change_pct
FROM latest l
JOIN past p ON p.id = l.id
WHERE p.past_price > 0
ORDER BY ABS((l.price - p.past_price) / p.past_price) DESC
LIMIT 30`,
[marketCode, rangeDays],
);
const all = result.rows.map((r) => ({
productId: r.product_id,
title: r.raw_title,
category: r.category_text ?? 'other',
retailerSlug: r.retailer_slug,
currentPrice: parseFloat(r.current_price),
currencyCode: r.currency_code,
changePct: parseFloat(r.change_pct),
}));
return {
marketCode,
asOf: String(now),
range,
risers: all.filter((r) => r.changePct > 0).slice(0, 10),
fallers: all.filter((r) => r.changePct < 0).slice(0, 10),
upstreamUnavailable: false,
};
}
export async function buildRetailerSpreadSnapshot(
marketCode: string,
basketSlug: string,
): Promise<WMRetailerSpreadSnapshot> {
const now = Date.now();
const result = await query<{
retailer_slug: string;
retailer_name: string;
basket_total: string;
item_count: string;
currency_code: string;
freshness_min: string | null;
}>(
`SELECT r.slug AS retailer_slug, r.name AS retailer_name, r.currency_code,
SUM(po.price) AS basket_total, COUNT(*) AS item_count,
EXTRACT(EPOCH FROM (NOW() - MAX(po.observed_at))) / 60 AS freshness_min
FROM baskets b
JOIN basket_items bi ON bi.basket_id = b.id AND bi.active = true
JOIN product_matches pm ON pm.basket_item_id = bi.id AND pm.match_status IN ('auto','approved')
JOIN retailer_products rp ON rp.id = pm.retailer_product_id AND rp.active = true
JOIN retailers r ON r.id = rp.retailer_id AND r.market_code = $2 AND r.active = true
JOIN LATERAL (
SELECT price, observed_at
FROM price_observations
WHERE retailer_product_id = rp.id AND in_stock = true
ORDER BY observed_at DESC LIMIT 1
) po ON true
WHERE b.slug = $1
GROUP BY r.slug, r.name, r.currency_code
ORDER BY basket_total ASC`,
[basketSlug, marketCode],
);
const retailers: WMRetailerSpread[] = result.rows.map((r) => ({
slug: r.retailer_slug,
name: r.retailer_name,
basketTotal: parseFloat(r.basket_total),
deltaVsCheapest: 0,
deltaVsCheapestPct: 0,
itemCount: parseInt(r.item_count, 10),
freshnessMin: r.freshness_min ? parseInt(r.freshness_min, 10) : 0,
currencyCode: r.currency_code,
}));
if (retailers.length > 0) {
const cheapest = retailers[0].basketTotal;
for (const r of retailers) {
r.deltaVsCheapest = Math.round((r.basketTotal - cheapest) * 100) / 100;
r.deltaVsCheapestPct =
cheapest > 0 ? Math.round(((r.basketTotal - cheapest) / cheapest) * 100 * 10) / 10 : 0;
}
}
const spreadPct =
retailers.length >= 2
? Math.round(
((retailers[retailers.length - 1].basketTotal - retailers[0].basketTotal) /
retailers[0].basketTotal) *
100 *
10,
) / 10
: 0;
return {
marketCode,
asOf: String(now),
basketSlug,
currencyCode: result.rows[0]?.currency_code ?? 'USD',
retailers,
spreadPct,
upstreamUnavailable: false,
};
}
export async function buildFreshnessSnapshot(marketCode: string): Promise<WMFreshnessSnapshot> {
const now = Date.now();
const result = await query<{
slug: string;
name: string;
last_run_at: Date | null;
last_run_status: string | null;
parse_success_rate: string | null;
freshness_min: string | null;
}>(
`SELECT r.slug, r.name,
dsh.last_successful_run_at AS last_run_at,
dsh.last_run_status,
dsh.parse_success_rate,
EXTRACT(EPOCH FROM (NOW() - dsh.last_successful_run_at)) / 60 AS freshness_min
FROM retailers r
LEFT JOIN data_source_health dsh ON dsh.retailer_id = r.id
WHERE r.market_code = $1 AND r.active = true`,
[marketCode],
);
const retailers: WMRetailerFreshness[] = result.rows.map((r) => ({
slug: r.slug,
name: r.name,
lastRunAt: r.last_run_at ? r.last_run_at.toISOString() : '',
status: r.last_run_status ?? 'unknown',
parseSuccessRate: r.parse_success_rate ? parseFloat(r.parse_success_rate) : 0,
freshnessMin: r.freshness_min ? parseInt(r.freshness_min, 10) : 0,
}));
const freshnessValues = retailers.map((r) => r.freshnessMin).filter((v) => v > 0);
const overallFreshnessMin =
freshnessValues.length > 0
? Math.round(freshnessValues.reduce((a, b) => a + b, 0) / freshnessValues.length)
: 0;
const stalledCount = retailers.filter((r) => r.freshnessMin === 0 || r.freshnessMin > 240).length;
return {
marketCode,
asOf: String(now),
retailers,
overallFreshnessMin,
stalledCount,
upstreamUnavailable: false,
};
}
export async function buildBasketSeriesSnapshot(
marketCode: string,
basketSlug: string,
range: string,
): Promise<WMBasketSeriesSnapshot> {
const now = Date.now();
const days = parseInt(range.replace('d', ''), 10) || 30;
const [essResult, valResult, currencyResult] = await Promise.all([
query<{ metric_date: Date; metric_value: string }>(
`SELECT ci.metric_date, ci.metric_value
FROM computed_indices ci
JOIN baskets b ON b.id = ci.basket_id
WHERE b.slug = $1 AND b.market_code = $2
AND ci.metric_key = 'essentials_index'
AND ci.retailer_id IS NULL AND ci.category IS NULL
AND ci.metric_date >= CURRENT_DATE - ($3 || ' days')::INTERVAL
ORDER BY ci.metric_date ASC`,
[basketSlug, marketCode, days],
),
query<{ metric_date: Date; metric_value: string }>(
`SELECT ci.metric_date, ci.metric_value
FROM computed_indices ci
JOIN baskets b ON b.id = ci.basket_id
WHERE b.slug = $1 AND b.market_code = $2
AND ci.metric_key = 'value_index'
AND ci.retailer_id IS NULL AND ci.category IS NULL
AND ci.metric_date >= CURRENT_DATE - ($3 || ' days')::INTERVAL
ORDER BY ci.metric_date ASC`,
[basketSlug, marketCode, days],
),
query<{ currency_code: string }>(
`SELECT currency_code FROM retailers WHERE market_code = $1 AND active = true LIMIT 1`,
[marketCode],
),
]);
return {
marketCode,
basketSlug,
asOf: String(now),
currencyCode: currencyResult.rows[0]?.currency_code ?? 'USD',
range,
essentialsSeries: essResult.rows.map((r) => ({
date: r.metric_date.toISOString().slice(0, 10),
index: Math.round(parseFloat(r.metric_value) * 10) / 10,
})),
valueSeries: valResult.rows.map((r) => ({
date: r.metric_date.toISOString().slice(0, 10),
index: Math.round(parseFloat(r.metric_value) * 10) / 10,
})),
upstreamUnavailable: false,
};
}
export interface WMCategoriesSnapshot {
marketCode: string;
asOf: string;
range: string;
categories: WMCategorySnapshot[];
upstreamUnavailable: false;
}
export async function buildCategoriesSnapshot(marketCode: string, range: string): Promise<WMCategoriesSnapshot> {
const now = Date.now();
const basketIdResult = await query<{ id: string }>(
`SELECT b.id FROM baskets b WHERE b.market_code = $1 LIMIT 1`,
[marketCode],
);
const basketId = basketIdResult.rows[0]?.id ?? null;
const categories = basketId ? await buildTopCategories(basketId) : [];
return {
marketCode,
asOf: String(now),
range,
categories,
upstreamUnavailable: false,
};
}

View File

@@ -0,0 +1,59 @@
import { describe, it, expect } from 'vitest';
import { scoreMatch, bestMatch } from '../../src/matchers/canonical.js';
import type { CanonicalProduct } from '../../src/db/models.js';
const baseCanonical: CanonicalProduct = {
id: 'c1',
canonicalName: 'Basmati Rice 1kg',
brandNorm: 'Tilda',
category: 'rice',
variantNorm: null,
sizeValue: 1000,
sizeUnit: 'g',
baseQuantity: 1000,
baseUnit: 'g',
active: true,
createdAt: new Date(),
};
describe('scoreMatch', () => {
it('gives high score for exact brand+category+size match', () => {
const result = scoreMatch(
{ rawTitle: 'Tilda Basmati Rice 1kg', rawBrand: 'Tilda', rawSizeText: '1kg', categoryText: 'rice' },
baseCanonical,
);
expect(result.score).toBeGreaterThanOrEqual(85);
expect(result.status).toBe('auto');
});
it('gives review score for partial match (no brand)', () => {
const result = scoreMatch(
{ rawTitle: 'Basmati Rice 1kg', rawBrand: null, rawSizeText: '1kg', categoryText: 'rice' },
baseCanonical,
);
expect(result.score).toBeGreaterThanOrEqual(50);
});
it('rejects clearly wrong product', () => {
const result = scoreMatch(
{ rawTitle: 'Sunflower Oil 2L', rawBrand: 'Generic', rawSizeText: '2L', categoryText: 'oil' },
baseCanonical,
);
expect(result.status).toBe('reject');
});
});
describe('bestMatch', () => {
it('returns null when no candidates', () => {
expect(bestMatch({ rawTitle: 'Eggs 12 Pack' }, [])).toBeNull();
});
it('returns best scoring non-reject match', () => {
const result = bestMatch(
{ rawTitle: 'Tilda Basmati Rice 1kg', rawBrand: 'Tilda', rawSizeText: '1kg', categoryText: 'rice' },
[baseCanonical],
);
expect(result?.canonicalProductId).toBe('c1');
expect(result?.score).toBeGreaterThanOrEqual(85);
});
});

View File

@@ -0,0 +1,60 @@
import { describe, it, expect } from 'vitest';
import { parseSize, unitPrice } from '../../src/normalizers/size.js';
describe('parseSize', () => {
it('parses simple gram weights', () => {
const r = parseSize('500g');
expect(r?.baseQuantity).toBe(500);
expect(r?.baseUnit).toBe('g');
expect(r?.packCount).toBe(1);
});
it('parses kilograms and converts to grams', () => {
const r = parseSize('1kg');
expect(r?.baseQuantity).toBe(1000);
expect(r?.baseUnit).toBe('g');
});
it('parses multi-pack patterns (2x200g)', () => {
const r = parseSize('2x200g');
expect(r?.packCount).toBe(2);
expect(r?.sizeValue).toBe(200);
expect(r?.baseQuantity).toBe(400);
});
it('parses multi-pack with × symbol', () => {
const r = parseSize('6×1L');
expect(r?.packCount).toBe(6);
expect(r?.baseQuantity).toBe(6000);
expect(r?.baseUnit).toBe('ml');
});
it('parses litre variants', () => {
expect(parseSize('1L')?.baseQuantity).toBe(1000);
expect(parseSize('1.5l')?.baseQuantity).toBe(1500);
expect(parseSize('500ml')?.baseQuantity).toBe(500);
});
it('parses count units', () => {
const r = parseSize('12 rolls');
expect(r?.baseQuantity).toBe(12);
expect(r?.baseUnit).toBe('ct');
});
it('parses piece counts', () => {
const r = parseSize('24 pcs');
expect(r?.baseQuantity).toBe(24);
});
it('returns null for unparseable text', () => {
expect(parseSize('large')).toBeNull();
expect(parseSize(null)).toBeNull();
expect(parseSize('')).toBeNull();
});
it('computes unit price correctly', () => {
const size = parseSize('1kg')!;
const up = unitPrice(10, size);
expect(up).toBeCloseTo(0.01); // 10 AED per 1000g = 0.01 per g
});
});

View File

@@ -0,0 +1,32 @@
import { describe, it, expect } from 'vitest';
import { cleanTitle, tokenOverlap } from '../../src/normalizers/title.js';
describe('cleanTitle', () => {
it('strips promo tokens', () => {
const r = cleanTitle('Fresh Organic Eggs - SAVE NOW!');
expect(r).not.toContain('fresh');
expect(r).not.toContain('save');
expect(r).toContain('organic');
expect(r).toContain('eggs');
});
it('lowercases and normalizes whitespace', () => {
const r = cleanTitle(' Basmati Rice ');
expect(r).toBe('basmati rice');
});
});
describe('tokenOverlap', () => {
it('returns 1 for identical titles', () => {
expect(tokenOverlap('Basmati Rice 1kg', 'Basmati Rice 1kg')).toBe(1);
});
it('returns 0 for completely different titles', () => {
expect(tokenOverlap('Eggs 12 Pack', 'Sunflower Oil 1L')).toBe(0);
});
it('returns partial overlap for partial matches', () => {
const r = tokenOverlap('Basmati Rice 1kg Pack', 'Basmati Rice Premium');
expect(r).toBeGreaterThan(0.5);
});
});

View File

@@ -0,0 +1,18 @@
{
"compilerOptions": {
"target": "ES2022",
"module": "ESNext",
"moduleResolution": "bundler",
"lib": ["ES2022"],
"outDir": "dist",
"rootDir": "src",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"resolveJsonModule": true,
"declaration": true,
"sourceMap": true
},
"include": ["src/**/*"],
"exclude": ["node_modules", "dist", "tests"]
}