Stage 6 of the Product Data Refinery
In our last issue, we ended with an AI agent shopping for a buyer, picking out a 4-inch flap disc, 80 grit, for 300 series stainless. Let’s say it found yours, and the data was clean enough to win the pick. Then it checks availability. The flap disc is discontinued, or the lead time is three weeks and the buyer needs it Thursday. What happens next is not a Stage 5 question.
Stage 5 made the product findable. Stage 6 decides whether there is anything intelligent to say when the product the buyer wanted is the product they cannot have. This is the final station of the refinery and ultimately the culmination of all the work set beforehand, and yet ironically, I see this as the outcome that practitioners generally want first.
Business framing: what a relationship actually is
Before we get too far in, let’s talk about what a relationship is. When I say my mother, everyone I say it to knows exactly what I mean. The word carries a defined relationship, with its own meaning, its own context, its own rules about how it behaves. My mother is not my brother, and neither one is my wife. Each is a relationship, and each means something specific and different. We would never drop all three into the same context and expect to use those relationships in the same way.
Product relationships are no different. An exact replacement is a different relationship than a substitute, which is a different relationship than a component or a complementary item. Each has a definition, and the definition is what tells every system and every person what the relationship is for. Within B2B procurement, those definitions really matter and if not properly defined, the buyer’s experience and decision making ability is greatly reduced.
Here is what makes this station different in kind from the five before it. Every prior station captured raw item data. SKU Definitions gave the product identity. Regulatory Data cleared it to move. Operational Data captured its physical reality. Fulfillment Data made its commercial terms honest. Marketing Data made it marketable. All of that lives on the record, as data about a single product.
Most companies treat Stage 6 no differently than the previous five stages. They maintain a table and mark relationships as they receive them manually, then use that table to leverage in downstream applications. While initially, this may be necessary, the more scaleable way to manage this is through rules and logic.
So what does Stage 6 look like? It is a structure adjacent to your products table with a specific set of definitions and prioritization logic. It is a governed rule-set layered across all five prior stations, logic that reasons over the data they already refined. It does not describe a product. It describes how products relate, and it does so by operating on everything beneath it. So even though this station holds no raw item data of its own, the rules and definitions are themselves a first-class asset, governed and maintained with the same seriousness as any field upstream. Here, the logic is the data.
So what should be the approach? The strategy can be simplified into two functions. The first is definition, which acts as a filter. A relationship definition narrows the entire catalog down to the subset of SKUs that legitimately qualify to connect to a given product. The second is prioritization, which is business strategy. Prioritization ranks that qualified subset by what the business is actually trying to optimize. Definition decides what is allowed in. Prioritization decides what comes first. Almost every failure at this station is a breakdown in one or the other. Later on in this blog we go deep into these two functions.
And this is the station that finally speaks to the outside world. It’s where B2B personalization really shines. Every prior stage refined data for its own sake. This one exists to tell downstream consumers how to convey and connect products, whether that consumer is the search engine, the product page, an internal customer service tool, or an agent shopping on a buyer’s behalf. Without this rule set, you can have five stations of pristine data and still no governed way to express how any of it relates inside an application anyone actually uses. The refined fuel just sits in the tank, unconnected.
Which is why it could only ever come last. A rule set that reasons across five stations cannot exist until those five stations produce something worth reasoning over. The capability everyone wanted first, the one we named back when the series started, was always going to be the one built on top of everything else.
What breaks when this stage is weak
The failures here are quieter than a missing image or a wrong price, which is exactly what makes them dangerous. The catalog looks like it has relationships. It just has undefined ones, or wrongly ranked ones, and nobody notices until a buyer does.
The substitution that sends the wrong part is the worst of them. In a B2B context, the wrong part is not a minor inconvenience. The wrong bearing stops a line. The wrong fitting fails a pressure test. The wrong grade of stainless corrodes in an application it was never rated for. When a substitution is offered without a defined relationship type behind it, the buyer cannot tell whether they are being handed an exact equivalent or something merely similar. A bad substitution is worse than no substitution, because no substitution sends the buyer looking and a bad one sends them shipping.
Then there is the subtler version, where the relationship is defined correctly and ranked wrong. The substitutes are all legitimate, but the system surfaces the premium competitor brand ahead of your own private label that carries better margin and ships tomorrow. Or it offers a like-for-like flavor when the buyer needed an exact match and nothing else will do. The family was right. The order was backwards. That is a prioritization failure, and it leaks margin quietly, order after order, without ever looking like a defect.
Relationships that point at dead SKUs are the signature failure of the one-to-one table. Somewhere a relationship was hand-entered, SKU to SKU, the day both were active. One of them gets discontinued upstream, and nothing tells the table. So the reference still resolves, technically, to a product that can no longer be bought, and the buyer clicks the alternate and lands on a dead end. The relationship was correct the day someone typed it. It rotted because the catalog changed underneath it and the table had no way to know.
Competitor cross-references that do not exist are a silent revenue leak. A buyer searches a competitor’s part number, because that is the number printed on the thing they are replacing. If you have mapped that number to your equivalent, you win an order you would never have otherwise seen. If you have not, you return nothing, and the buyer goes to whoever did the mapping. The order was available. The data was missing.
Repair and replacement parts that are not connected to their parent send your own customers back to the OEM. Someone bought a pump from you. A year later they need the seal kit, and if your data does not connect that kit to that pump, they cannot find it on your site, so they go looking somewhere that makes it easy. Often the root cause is that you never had the component list to begin with, because the manufacturer never shared the full bill of materials and half the parts were never marketable enough to publish. You sold the expensive thing and lost the recurring revenue on the parts that keep it running, which is frequently where the margin actually lives.
Then there is the noise problem, the one that looks like a relationship capability but is its opposite. Customers also bought, populated entirely from raw transaction coincidence. A buyer purchased a flap disc and a case of bottled water on the same order, so now the catalog recommends water to everyone buying abrasives. The relationship is real in the data and meaningless in the world. Ungoverned behavioral data produces confident nonsense at scale.
Underneath all of it is the failure that kills the capability for good. When relationships are maintained by hand, one row at a time, nobody can keep up, so they decay. Endpoints get discontinued, definitions stay fuzzy, the behavioral data drifts, and within a year the merchandising team has quietly learned not to trust the cross-references. A capability the business invested in becomes a feature nobody uses. The data is still there. The trust is gone.
In the agentic context, every one of these gets less forgiving. The agent does not hover over a questionable substitution and decide to call your branch. It either has a relationship it can trust, or it moves on to a catalog that does. When the primary is unavailable, your relationship data is the only thing standing between you and a silently lost order.
What “good” looks like at this stage
Healthy relationship data starts with a definition, not a connection. Before a single relationship is rendered to anyone, the business has decided what its relationship families are and what each one means: exact, substitute, complementary, component, and the specific rules that qualify a SKU for each. The definition is written down, it means the same thing to every system and every person, and it is the filter that decides what is even eligible to be connected.
Every relationship is typed against those families. A connection is never just related. It carries its family, so the buyer and the agent both know whether they are looking at an exact swap or a suggestion, and the display logic can treat the two differently because the data finally tells it which is which.
The qualified set is then prioritized on purpose. Where several substitutes legitimately exist, they are ranked by what the business is optimizing, margin, private brand, availability, customer relevance, and the ranking reads live data, so a discontinued or out-of-stock SKU drops on its own. Nobody suggests a dead product, not because someone caught it, but because the rule never selects it. The priority is governed, and it is governed as strategy.
Every relationship has a known source, and source determines trust. The business knows whether a connection came from a manufacturer’s cross-reference file, a competitor mapping the team built, an SME’s judgment, or transaction behavior, and a manufacturer-asserted exact carries more weight than a bought-together coincidence rather than landing in the same undifferentiated bucket.
The mature state surprises people: you are no longer maintaining a relationship table at all. The connections are generated by rules running over the governed data, so the table, where one still exists, is an output rather than the source of truth. When you started, you managed relationships one row at a time, SKU to SKU. As you matured, you realized the rows could be produced by the rules and definitions, and administering the relationships themselves became unnecessary. You manage the rules, and the relationships take care of themselves, because the data they stand on is clean enough to compute against.
When all of that is in place, the capability everyone wanted finally works. The substitution saves the sale instead of risking a return. The competitor cross-reference converts a search into an order. The repair part keeps the customer on your site for the life of the equipment. Product discovery and average order value move, measurably, and the rule set you have built becomes something a competitor cannot easily copy. We will come back to that, because it is the part that matters most to leadership.
Building the filter: definition and prioritization
Every station in this refinery has had a filter, shaped by the kind of data it refines. SKU Definitions used sequential governance layers. Regulatory Data used a category-driven rulebook. Operational Data used a structural filter where every product needed every layer. Fulfillment Data was bilateral, distributor-defined and supplier-confirmed. Marketing Data was distributor-standardized content normalization. Each of those filters checked data on its way onto a record.
This filter does not sit in front of a record. It is the rule set itself, and it has two jobs. It defines which products qualify to connect, and it prioritizes the ones that do. Definition is a filter. Prioritization is a business decision. Everything operational at this station lives inside one of those two.
| DEFINITION Filters the catalog to the qualified set |
→ | PRIORITIZATION Ranks that set by business strategy |
→ | AUTOMATION Generates and maintains the result to scale |
Definition: deciding what qualifies
Definition is the discipline of deciding, precisely, what each relationship means and which SKUs are eligible for it. Before any of it works, both ends of a relationship have to be real. Both products must be active and fully refined through the five prior stations, because a relationship that points at a SKU with a blank attribute set, an unverified lead time, or a discontinued status is not a relationship. It is a guess with a link attached. This is where the sequence of the whole series pays off, or bites.
From there, every relationship has to belong to a family, and the families are not interchangeable.
Exacts. Products that are the same product, even when they are not the same SKU. A flap disc sold in a pack of five and the same disc in a single blister pack are two records in the ERP and one product in reality. Exacts also carry conditions underneath them. A unit made in the USA and the same unit sourced globally may be functionally identical and still not interchangeable for a buyer who requires domestic origin, and country of origin is becoming a sharper differentiator every year. Defining an exact already reaches back into Operational Data for packaging and into Regulatory Data for origin. The simplest-sounding family quietly depends on two stations upstream.
Substitutes. Where definition gets demanding, because substitute can mean several things and the business has to be disciplined about which. A substitute can be an exact from a different brand, fit, form, and function equal to the one the buyer searched for. It can also be a like-or-kind alternative, close enough for the job but not identical. The easiest way to see the difference is outside the warehouse: if a buyer wants raspberry and you are out, is the substitute another flavor, or a competing brand entirely? Both are defensible, neither is automatic, and the definition has to say which kinds of substitution are allowed before anything can be ranked.
Components. Conceptually simple and operationally stubborn. A component relationship says this part lives inside that product, the seal kit to the pump, the blade to the saw. The concept is a table of what belongs in what. The difficulty is that a distributor buying from a manufacturer often never receives the full bill of materials, and many of those parts were never marketable in the first place, a bare description and no image, serviced after the sale through a repair shop or an aftermarket channel. Easy to define, hard to populate, and the gap is usually upstream of you.
Complementary. The most subjective family and the most forgiving. A complementary relationship suggests what else goes with this, and done well it grows wallet share by putting capabilities in front of customers who would not have thought to ask, while signaling that you understand their work. It leans on personalization and on historical transaction data from the ERP, and it matures from simple co-purchase logic toward deliberately positioning the most profitable and most relevant products into a given customer’s experience. Forgiving does not mean careless. It means the cost of a miss is a shrug rather than a returned shipment.
| Family | What it is | What it depends on | Where it breaks |
|---|---|---|---|
| Exacts | Same product, different SKU. Pack variants, origin variants, all required marketing data is the same. | Operational Data for packaging. Regulatory Data for origin. | Treating a domestic-origin unit and a globally-sourced one as identical when origin matters to the buyer. |
| Substitutes | A different product that matches fit, form and function. Direct Brand to Brand equivalents, critical product requirements the same with slight low-risk variations. | Clear definitions of which kinds of substitution are allowed, plus the SKU attributes that prove equivalence. | Offering a like-or-kind when an exact was needed. In B2B, a wrong substitution can stop a production line. |
| Components | A part that lives inside a product. The seal kit to the pump, the blade to the saw. | Full bill of materials from the manufacturer, often not shared with the distributor. | No BOM means no connection. The customer goes back to the OEM for the recurring parts you should have sold them. |
| Complementary | What else goes with this. Grows wallet share and signals you understand the customer’s work. | Personalization, historical ERP transaction data, and profitability signals. | Raw co-purchase coincidence. Recommending bottled water to a buyer of abrasives because they once bought both. |
Finally, every relationship records where it came from, because source determines trust. Manufacturer cross-reference files, in-house competitor mappings, SME judgment, and transaction-derived behavior are not equal, and the definition treats them by trust level rather than by whoever loaded their file last. A relationship you cannot trace is a relationship you cannot defend.
Owned by: Category and Product Data SMEs, dependent on Stages 1 through 5.
Prioritization: deciding what comes first
Definition produces a set of products that legitimately qualify. Prioritization decides the order they appear in, and that order is a business decision, not a technical default. This is the single most complex part of the station, and it is where data discipline hands the wheel to strategy.
Some of it is structural. Relationships have direction, because a premium part substitutes down to a standard one more safely than the reverse, and an exact in one direction is not always an exact in the other. Direction is the first thing prioritization gets right.
Most of it is strategy. When several substitutes qualify, what ranks first? The brand you hold a margin agreement with? Your own private label, cheaper to you and ready to ship from your shelf? The item that is actually in stock today? Those are P&L and KPI questions, answered by executives and category leaders who decide what the business is optimizing and then encode it as rules the system applies every time. There is usually a brand table behind it, or a cost-based logic, or both, feeding the boost-and-bury decisions that determine what a buyer sees first. Stage 6 consumes the pricing from Stage 4 and feeds the search logic Stage 5 lives in, but the decision about what to favor is made here, deliberately, as relationship strategy.
And prioritization reads live data, which is what makes the decay problem disappear. A discontinued status is just another input. Done right, the rule that ranks substitutes already knows a discontinued SKU should fall to the bottom or out entirely, because you would never want to suggest something a buyer cannot purchase. Nobody runs a cleanup project to bury the dead product. The rule buried it the moment the status changed.
Owned by: Business leadership, encoded in the rules and executed by the system.
Put the two functions together and something changes about how the whole station operates. When definition and prioritization are both expressed as rules running over governed data, relationships stop being records you create and become outputs the rules generate. The relationship table, where one still exists, is a materialized result scoped to a particular application, a clean filter built by a script for a specific use case, not the master anyone hand-edits.
| Managing rows | Managing rules |
|---|---|
| Relationships are records you create one at a time, SKU to SKU. | Relationships are outputs generated by rules running over governed data. |
| Updates happen manually, one record at a time or in batches | Rules read live data. The system updates itself. |
| A discontinued SKU triggers a cleanup project to find and re-point references. | Discontinued status is just another input the ranking logic already consumes. |
| What you maintain is the table. | What you maintain is the logic. |
| Scale is limited by how many rows people can keep current. | Scale is limited only by how cleanly you have defined the rules. |
That graduation is only possible because the five upstream stations gave you data clean enough to write rules against. The compounding value of the refinery shows up, finally and literally, as relationships that build and maintain themselves.
The leadership mindset for this stage
Leadership has wanted this capability since the beginning of the series. That is not a criticism. The desire is correct. This is where product data stops being infrastructure and starts looking like intelligence, and any leader paying attention wants to get there. The mistake was never wanting it. The mistake was wanting to start there, before there was anything clean enough to build rules on.
Two responsibilities sit squarely with leadership at this station, and both are easy to skip.
The first is the definition. The recurring dysfunction I see is leadership asking for the relationship capability over and over while never once requiring anyone to define what the relationships mean. They want the output and skip the governance, then wonder why the output cannot be trusted. A merchandiser can apply relationship families all day. A merchandiser cannot be the one to decide, unilaterally and without authority, that an exact carries a shipping guarantee and a substitute does not. That is a business definition with real liability behind it, and it has to be owned where the accountability lives.
The second is the prioritization strategy. The ranking rules encode what the business is optimizing, and that is a leadership call, not an engineering one. Whether to lead with the private label, how much to favor a margin-rich brand, when availability should override preference: these are P&L decisions that get baked into the logic and then drive thousands of buyer interactions a day. If leadership does not set that strategy on purpose, the system will rank by accident, and accidental ranking is just someone else’s defaults running your revenue.
There is also the question of what to fund over time, and the answer has changed by the time you reach this station. The ongoing work here is not maintaining a table. It is tending the rules and evolving the definitions as the business changes: a new private brand to favor, a new competitor to map, a new origin requirement to honor. Fund this as a one-time build and the logic ossifies while the business moves. The investment that matters is in the rules, and the reward for making it is that the relationships stop needing babysitting.
One last point, because it is the argument that should change how leadership values the entire series. The moat here is not a big table of cross-references that a competitor could scrape or rebuild. The moat is the governed rule set and the refined data underneath it. Anyone can clean their dimensions or fix their images. A coherent body of definitions and prioritization logic, running over five stations of disciplined data and tuned over years to what the business actually optimizes, is hard to replicate and impossible to simply buy. It is the point where the compounding work of the refinery becomes a real advantage. The businesses that built the foundation first are the only ones who can build this at all. That is the whole argument of the series, arriving at its destination.
Where to start
The fastest way to see the state of your relationship data is to look at the connections you are already being asked for and check whether they hold.
Pull your most-requested competitor cross-references. Every distributor has them, the part numbers customers and salespeople ask you to cross most often, the ones that show up in quote requests and call notes. Take that list and check, one by one, how many resolve to an active, fully refined equivalent in your catalog, and whether the ones that do are both defined and ranked.
What you find will tell you more than the count. The matches you do have will be inconsistent, and when you ask why, you land on both halves of the problem at once. One match is an exact, another is a loose substitute nobody documented as such, a third is somebody’s guess from years ago. That is the definition gap. And where you have several legitimate options, you will usually find them in no particular order, or in an order nobody decided on purpose. That is the prioritization gap. You will have proven the whole thesis to yourself, in your own catalog, in an afternoon.
From there the work sequences itself. The families get defined, because you just felt the cost of leaving them fuzzy. The endpoints get qualified against the upstream stations. The ranking logic gets a real owner and a real strategy. And somewhere in the process you stop adding rows and start writing rules, which is the moment the capability begins to scale.
One list of part numbers. One honest check of whether they hold. Most businesses are surprised, and occasionally a little embarrassed, by how much of their relationship data turns out to be assertion rather than rule.
The refinery, complete
That is the sixth station, and with it the refinery is whole. Crude product data arrived at the intake valve as raw supplier identity and moved through regulatory clearance, operational reality, commercial honesty, and merchandising discipline. Five stations, each refining the fuel a little further.
The sixth station is not another tank of fuel. It is the control logic that decides how the fuel gets routed, ranked, and communicated to everyone downstream, the search engine, the product page, the service rep, the agent buying on a customer’s behalf. It holds no raw item data of its own. It is pure rule, layered across everything the other five produced, which is exactly why it could only come last and exactly why it is the payoff.
The station everyone wanted first turns out to be the one that could only ever come last, because it was always logic built on top of data, and the data had to be refined before the logic could be trusted. Get the five stations right, in order, and the sixth becomes close to effortless, rules quietly doing the work that used to take a team and a spreadsheet. Skip ahead, and you write rules over data that cannot support them, which is precisely how the capability everyone wanted most became the one so few of them could trust.
With every station mapped, the series has done its job. It has laid out the refinery end to end, in the order the work has to happen, and shown where data discipline finally hands the wheel to business strategy. What comes next is less about understanding the model and more about running it, station by station, in your own business. That is where the real refining starts.


0 Comments