FP-GrowthData EngineeringMLRFM model

CartoGraphy — Retail Intelligence from 3.2M grocery orders

A full stack analytics platform built on the Instacart Market Basket dataset. It mines association rules, visualises product co-purchases graph and profiles customers with RFM segmentation all of which is pre-computed and instantly queryable.

Problem

Retailers sit on enormous transactions, customer and product data but rarely use them in a form that drives decisions. The main business decision is to use raw data to convert it into information and insights that helps in making better informed business decesions. Raw rule lists are hard to scan and the network structure is invisible in a table. Customer segmentation according to loyalty and shopping behaviour is a true insight that can help in targeted marketing. The goal was to build a single interface that lets you go from a high level KPI to a specific product pair to customer segments that drive it in seconds displaying visually, without writing SQL.

Dataset

3.2M

Orders

33M

Line items

49.7k

Products

206k

Customers

The dataset covers over 3.2 million anonymised Instacart orders from ~200k customers over several months. Each order contains a sequence of product IDs with aisle and department metadata. My offline pipeline converts them to Parquet for ~10× faster reads and then produces a set of pre-computed product artifacts that the API displays at startup.

Methodology

1. Association rule mining by FP-Growth

Frequent Pattern Growth was chosen over Apriori for its single pass tree structure that avoids the need for candidate generation bottleneck. Mining was run on a 500k-order sample yielding around 4032 rules. The key challenge was memory: mlxtend's fpgrowth() forces a dense materialization of the one-hot matrix on the full item. The fix was to pre filter items to those meeting min support before encoding, reducing the matrix to a manageable size.

2. Product network

Rules are projected as a directed graph where each antecedent→consequent pair becomes an edge weighted by lift. The top 500 rules by lift produce a 134 node, 437 edge graph rendered network using the cose bilkent force directed layout. Nodes are coloured by aisle so cross category affinity clusters are immediately visible.

3. RFM segmentation

Each customer is scored on Recency (days since last order), Frequency (total orders) and Monetary (total items). Scores are quintile bucketed and customers are k-means clustered in RFM space to produce behaviorally coherent segments. Each segment's over and under indexed aisles are computed as the ratio of the segment's aisle frequency to the population baseline giving us the "signature basket" for each group.

4. Live recompute

A POST /api/recompute endpoint runs FP Growth on 100k-order in memory sample, returning fresh rules and a new network graph. Results are LRU cached so repeated slider positions are instant.

Results

4,032

association rules mined

at 0.3% support, 5% confidence

6.83×

highest lift rule

Grapefruit sparkling water → Lime sparkling water

customer segments

k-means in RFM space ~206k users

Tech Stack

Frontend

·Next.js App Router
·TanStack Query
·Cytoscape.js
·Tailwind CSS

Backend

·FastAPI
·mlxtend FP-Growth
·pandas Parquet
·LRU-cached Python ETL