Building a Price Comparison Engine with a Structured Product API

Why price comparison is mostly an identification problem, not a pricing one — and how to build a comparison engine on top of a structured product API instead of a scraping fleet.

Why price comparison is harder than it looks

A price comparison engine sounds like a weekend project. Pull prices from a few retailers, sort ascending, ship.

Anyone who has actually tried this knows the truth: the hard part isn't the sort. It's matching the same product across retailers that name it differently, photograph it differently, and bury its specs three clicks deep. It's deciding what counts as "the same" when one retailer sells a bundle and another sells the bare unit. It's keeping the data current without operating a scraping fleet on the side.

This article walks through how to build a modern price comparison API-backed engine — what to retrieve, how to match products, where structured data buys you the most leverage, and what you should not try to do yourself.

The four problems of any comparison engine

Strip away the UI and a price comparison product reduces to four jobs:

Product identification — given a user query, decide which product they actually mean.
Reference pricing — know roughly what this product costs on the market.
Retailer linking — surface where it's sold and let the user click through.
Differentiation — explain why two listings that look the same are actually different (size, color, bundle, generation).

A naive scraper-based stack tries to solve all four at once by crawling every retailer. A structured product comparison API approach separates them: use a typed product API for identification, reference pricing, and differentiation; layer retailer-specific feeds on top only for the listings that matter most to you.

Start with product identification, not pricing

Most teams start by scraping prices. That's backwards.

If you can't reliably answer "is this the same product?", every other piece falls apart. Two listings titled Sony WH-1000XM5 Headphones and Sony WH-1000XM5 Wireless Noise Cancelling, Black are the same product. Sony WH-1000XM4 is not. Sony WH-1000XM5 Refurbished + Case Bundle is debatable.

A structured product API gives you a canonical product identity — brand, model, category, specs — that you can hash, dedupe, and match against. Once you have that, retailer listings are just pointers to the same canonical entity.

A typical identification request:

curl "https://productapi.dev/api?search=Sony+WH-1000XM5&fields=brand,model,category,bluetooth_version,anc,battery_life_hours,weight_grams" \
  -H "X-API-Key: your-api-key"

You get back a typed object you can store as the canonical record. Every retailer listing your engine ingests later links to this record, not to its own scraped name.

Reference prices vs live prices

Be honest with yourself about what kind of price comparison you're building.

A reference price answers "what does this typically cost?" It's a market-level number — useful for sorting, filtering by budget, detecting outliers, building "compare to street price" UIs, and giving users a sanity check before they click out.

A live price answers "what is retailer X charging right now?" That's a different problem — you need direct retailer feeds, affiliate APIs, or a maintained scraping pipeline, with the cache invalidation and proxy budget that implies.

Most comparison engines need both, but in different proportions than teams initially assume. The reference price is the heavy-lifting layer — it gives you 80% of the UX (sorting, filtering, context). Live prices are the thin top layer for the retailers you've genuinely partnered with.

A product pricing API that returns a reference price covers the heavy-lifting layer with a single call:

curl "https://productapi.dev/api?search=Dyson+V15+Detect&country=FR&currency=EUR&fields=brand,model,price_eur,category" \
  -H "X-API-Key: your-api-key"

That number is what powers "sort by price" across your entire catalog. Add live retailer prices on top only where the volume justifies the integration cost.

Matching across retailer listings

Once you have canonical products, ingesting retailer listings becomes a matching problem. For each scraped listing, you need to decide: which canonical product does this point to?

A few patterns that work:

Match by brand + model

The most reliable signal. If a listing says "Sony" and "WH-1000XM5", and your canonical record says the same, you have a match. This handles 60–70% of typical retailer listings without any fuzzy logic.

Match by GTIN/UPC/EAN when available

Some retailers expose barcodes; most don't. When they do, it's the strongest signal you'll get. Store the barcode on the canonical record once you've seen it from any trusted source, and use it as a fast-path lookup for future listings.

Fall back to embeddings or fuzzy match

For the long tail (listings without clean brand/model strings), compute an embedding from the listing title and match against canonical product names. This is where you'll spend most of your match-quality budget.

A product matching API that returns canonical brand/model already structured saves you the embedding step for the easy 70% — you only fall back to fuzzy match for the residual.

What goes in a comparison row

A typical comparison engine row needs more than a price. Done well, it shows:

Canonical name — the same name across retailers, not whatever each retailer typed
Canonical hero image — same product, same shot, so scanning is fast
Reference price + retailer-specific prices — context and current cost
Key differentiating specs — the 3–5 specs that matter for this category (battery life for headphones, screen size for laptops, etc.)
Variant warnings — "this is the 512GB version, the £899 listing is 256GB"

The first four come straight from a structured product API. The variant warning is what separates a serious comparison engine from a list of links — and it requires that your canonical record carries the variant-relevant fields (storage, color, bundle status) explicitly.

Localized comparison

Price comparison is locale-sensitive in ways that catch teams off guard:

Currency — €379 and £329 are not directly comparable; you need both displayed and ideally one normalized to the user's currency.
Tax inclusion — most EU prices include VAT; U.S. prices don't. Comparing across these is a UX trap.
Availability — a great price in a country the user can't ship from is useless.
Language — product names should display in the user's language, not whatever the retailer used.

A localized product API handles the first and last natively — pass country, lang, and currency on every request and the canonical record comes back already in the user's locale. Tax convention and availability are still on you, but you've eliminated the two most embarrassing failure modes.

curl "https://productapi.dev/api?search=lave-vaisselle+encastrable+Bosch&country=FR&lang=fr&currency=EUR" \
  -H "X-API-Key: your-api-key"

What to not build yourself

A few categories of work a comparison engine should outsource ruthlessly:

Product canonicalization

Don't build a brand/model normalizer. Names, capitalizations, and SKU formats are messier than you think, and the long tail is endless. Use a structured product API that returns canonical fields.

Hero image selection

Don't write a "best image" picker. Use the canonical image from the product API. Retailer-specific images can be displayed next to the canonical one, but the canonical version is what users see in scan view.

Category taxonomy

Don't invent your own product taxonomy if a product API already exposes one that's good enough for your domain. Map into a third-party taxonomy if you must; don't build a parallel one.

Specs extraction

Don't write per-retailer spec extractors. They rot. Use a product specs API that returns typed specs for the canonical product, and only fall back to retailer-specific extraction for fields the canonical record doesn't cover.

What you do build

The defensible parts of a comparison engine are:

Your retailer partnerships and live-price coverage — that's leverage you own.
Your UX — how comparison rows are laid out, what specs you surface, how you handle variants.
Your trust and editorial layer — review aggregation, fraud detection, your own benchmarks.
Your audience and SEO surface — the category pages, the buying guides, the freshness of your catalog.

Spend your engineering budget on those. Spend none of it on yet another normalizer for headphone names.

A concrete data flow

For a comparison page, a clean flow looks like:

User lands on /compare/sony-wh-1000xm5/.
Your route fetches the canonical product record (structured product API).
Your route joins against your retailer listings table on canonical product ID.
Live prices are refreshed from the retailers you have direct integration with (your own job, your own cache).
The page renders: canonical name + image + specs + reference price + retailer-by-retailer live prices.

The product API call is one request. The rest is your business logic. That's the split you want.

A practical example

A minimal comparison endpoint in TypeScript:

async function getComparisonRow(query: string, locale: { country: string; lang: string; currency: string }) {
  const params = new URLSearchParams({
    search: query,
    country: locale.country,
    lang: locale.lang,
    currency: locale.currency,
    fields: "brand,model,category,image,price_eur,anc,battery_life_hours,weight_grams",
  });

  const res = await fetch(`https://productapi.dev/api?${params}`, {
    headers: { "X-API-Key": process.env.PRODUCT_API_KEY! },
  });

  const { products } = await res.json();
  const canonical = products[0];

  const liveListings = await db.listings.findByCanonicalKey({
    brand: canonical.brand,
    model: canonical.model,
  });

  return {
    canonical,
    reference_price: canonical.price_eur,
    listings: liveListings,
  };
}

Twenty lines. No scraping. No name normalizer. You own the listings table and the UI; everything else is a typed response from a single endpoint.

Common pitfalls

A few traps that bite teams the first time they ship:

Treating reference price as live price

The reference price is a market-level number. Don't show it as "today's price at MegaShop." Label it as a reference, a typical price, or a market average. Users understand the distinction; lawyers do too.

Not caching by locale

compare:wh-1000xm5 is not a valid cache key. compare:wh-1000xm5:FR:fr:EUR is. Same product, different locales, different responses.

Letting the long tail rot

A comparison engine is judged by its breadth. A page that proudly shows three top products but 404s on the fourth most-searched item in its category is a worse experience than no comparison engine at all. Use a product API that retrieves on demand so your catalog isn't gated by a crawl that happened last quarter.

Sorting by price without dedup

If you have three listings for the same canonical product, "sort by price ascending" should show the cheapest of the three, not all three. Dedup before sort, on canonical product ID.

Try it

A one-request comparison row, end to end:

curl "https://productapi.dev/api?search=Apple+MacBook+Air+M3+13&country=US&currency=USD&fields=brand,model,price_usd,ram_gb,storage_gb,screen_inches,weight_kg" \
  -H "X-API-Key: your-api-key"

Get an API key — 20 free credits, no card required.

TL;DR

A price comparison engine is mostly an identification problem, not a pricing problem.
Use a structured product API for canonical names, images, specs, and reference prices; layer live retailer feeds on top only where you have real partnerships.
Match retailer listings to canonical products by brand + model first, GTIN second, fuzzy match for the long tail.
Localize country, language, and currency on every request — and cache by all three.
Spend your engineering budget on the parts that differentiate you: UX, partnerships, editorial. Outsource the parts that don't.