Automating Product Catalog Enrichment with a Schema-Defined API

How to turn sparse product records into complete, typed catalog entries — descriptions, images, specs, and custom fields — using a single JSON product API instead of a homegrown stack.

The catalog enrichment problem

Every e-commerce team eventually faces the same wall: you have a list of products that need to go live, and you do not have the data to make their pages useful.

You have a name, maybe a brand, maybe a category. What you don't have is:

A clean, marketing-grade description
A consistent set of technical specs
High-resolution images
A reliable price benchmark
The structured fields your PIM, search index, or recommendation engine actually needs

Filling this gap is product catalog enrichment, and historically it has involved one of three painful options: typing it by hand, scraping competitors' product pages, or buying a flat-file dump from a data vendor and hoping it matches your SKUs. All three are slow, expensive, and produce data that drifts the moment products change.

A schema-defined product API changes the economics. Instead of paying for a batch dump or maintaining a scraping fleet, you call one endpoint with the schema you want and get the data back, typed, in JSON, ready to write to your database.

This article walks through how to actually do that — what to enrich, how to define the schema, and how to integrate it into a PIM or catalog workflow.

What "enrichment" actually means

Enrichment is the process of going from a sparse product record to a complete one. A sparse record might look like this:

{
  "sku": "INTERNAL-9921",
  "name": "Sony WH-1000XM5",
  "category": "headphones"
}

After enrichment, the same record should look something like:

{
  "sku": "INTERNAL-9921",
  "name": "Sony WH-1000XM5",
  "brand": "Sony",
  "category": "headphones",
  "description": "Premium over-ear wireless headphones with industry-leading...",
  "image": "https://...",
  "images": ["https://...", "https://...", "https://..."],
  "specs": {
    "driver_size_mm": 30,
    "anc": true,
    "battery_life_hours": 30,
    "weight_grams": 250,
    "bluetooth_version": "5.2",
    "codecs": ["LDAC", "AAC", "SBC"]
  },
  "price_eur": 379
}

The hard part isn't agreeing that you want this. It's getting it consistently, for every product, across thousands of SKUs, in a shape your system actually expects.

Define the shape first

The biggest mistake teams make when building enrichment pipelines is starting from "what data exists?" instead of "what data do I need?"

A PIM enrichment API that returns whatever happens to be on the product's web page leaves you with the same problem you started with — heterogeneous, untyped, surprising data. You have to write a normalization layer per category, and you have to keep rewriting it as new categories appear.

A schema-defined product data workflow inverts this. You write the schema once, per category, and the API conforms to it:

{
  "headphone_schema": {
    "name": "string",
    "brand": "string",
    "anc": "boolean",
    "driver_size_mm": "number",
    "battery_life_hours": "number",
    "weight_grams": "number",
    "bluetooth_version": "string",
    "price_eur": "number"
  }
}

Every record comes back matching the schema. If a field can't be determined, it's null — not invented. Your database mapping never changes, your downstream consumers never break, and your enrichment job is a one-liner.

Custom fields per category

Different products care about different things. A laptop has cores and RAM; a sofa has dimensions and material; a supplement has dosage and ingredient lists. A custom fields product API lets you ask for any of these without writing a new extractor each time.

In practice this means you maintain a small library of category schemas in your repo:

const schemas = {
  laptop: ["cpu", "ram_gb", "storage_gb", "screen_inches", "weight_kg"],
  sofa: ["material", "seat_count", "width_cm", "depth_cm", "color"],
  supplement: ["form", "dosage_mg", "servings", "ingredients[]"],
};

When a new product comes in, you look up the schema for its category, fire the request, and write the typed response straight to your database. No per-category code. No "we'll handle this category next quarter."

Component pieces of enrichment

In practice, "enrichment" decomposes into several smaller jobs. Some teams want all of them; some just want one.

Descriptions

A product description generator API produces marketing-grade prose, grounded in real product information. The important word is "grounded" — generated descriptions that quietly invent features are worse than no description at all. A schema-defined approach gives you the description and the facts behind it, so you can verify the prose against the structured data before it goes live.

Images

A product image API returns one canonical image plus a gallery of additional images. Most product pages render correctly with a single hero shot, but conversion-oriented teams want the full gallery to let users explore. Image URLs come back ready to write straight into your CDN-fronted catalog.

Specifications

A product specs API is the bread-and-butter of category pages. Specs are what users filter on, what comparison tables are made of, and what your search index needs to be useful. Specs are also where consistency matters most: a "battery_life_hours" field that's a string in 80% of records and a number in 20% breaks your filters silently. Typed responses make this category of bug impossible.

Pricing & availability

A reference price is often enough — exact real-time pricing from a specific retailer is a different problem (and a different API). But knowing a typical street price is gold for: filtering by budget, sorting by price, detecting outliers in your own pricing, and powering "list price" UIs.

Why typed product data matters

Untyped data is the silent killer of catalog pipelines.

Consider a "weight" field. Across sources, you'll see:

"3.4 lbs"
"1.54 kg"
1540 (grams)
"3.4" (string, unit unknown)
null
"approx. 3 to 3.5 kg"

A scraping pipeline that just passes these through has shipped you a bug. Your filter UI says "weight under 2kg" and silently includes products that don't match.

Typed product data means the API commits to a contract: weight_kg is a number representing kilograms, or it's null. Whatever conversions, range parsing, or unit harmonization are needed happen on the API side, not in your code. Your database column type is correct. Your filters work. Your sorts are stable.

A practical PIM workflow

If you have a PIM (Akeneo, Pimcore, Salsify, or a homegrown system), enrichment usually slots in at one of three points:

Onboarding. New SKU enters the system → enrichment runs → record is created complete.
Backfill. Existing sparse records get re-enriched on a schedule.
Audit. Periodically compare PIM values against fresh enrichment to catch drift.

A typical onboarding handler:

async function onboardProduct(input: { name: string; category: string }) {
  const schema = schemas[input.category];
  const res = await fetch(
    `https://productapi.dev/api?search=${encodeURIComponent(input.name)}&fields=${schema.join(",")}`,
    { headers: { "X-API-Key": process.env.PRODUCT_API_KEY! } }
  );
  const { products } = await res.json();
  return products[0]; // typed, ready for INSERT
}

A few lines. No scraping fleet. No vendor SFTP feed.

Why one endpoint beats five tools

A typical homegrown enrichment stack looks like:

A scraping service (proxies, browser farm, retries)
An extractor service (CSS selectors, ML extractors, per-source code)
A normalizer (unit conversion, currency conversion, dedup)
An image processor (download, resize, CDN upload)
An LLM call to generate descriptions

That's five services, each with their own failure modes, monitoring, and pager rotation. A consolidated JSON product API collapses them into one. You make one request, you get a typed object, you write it to your store.

You can always add specialization on top — image post-processing, custom prose tone, internal taxonomy mapping — but the retrieval and structuring step is the part you should not be building yourself.

What to watch out for

Enrichment is not free of pitfalls. Things to check before you turn it on at scale:

Match confidence. If you have only a vague name, the API may return a product that's close but not the same. For high-stakes catalogs (regulated goods, medical, etc.), gate writes on a confidence check.
Update cadence. Enrichment freezes data in time. Re-run periodically so descriptions and prices don't go stale.
Schema versioning. When you change your category schema, version it. Records enriched under v1 should still be readable, even if v2 has more fields.
Validation on write. Even with a typed API, validate against your own schema on insert. Trust but verify.

Try it

One request, with the schema you actually want:

curl "https://productapi.dev/api?search=Sony+WH-1000XM5&fields=brand,anc,battery_life_hours,weight_grams,price_eur" \
  -H "X-API-Key: your-api-key"

Get an API key — 20 free credits, no card required, live in two minutes.

TL;DR

Product catalog enrichment turns sparse records into complete ones.
A schema-defined product data workflow inverts the old "take what you can scrape" model — you declare the shape, the API conforms.
Custom fields per category, typed values, and a single JSON product API replace a stack of five homegrown services.
Slot it into PIM onboarding, backfill, or audit — pick one, ship it this week.