CartoGraphy — Retail Intelligence from 3.2M grocery orders
A full stack analytics platform built on the Instacart Market Basket dataset. It mines association rules, visualises product co-purchases graph and profiles customers with RFM segmentation all of which is pre-computed and instantly queryable.
Problem
Retailers sit on enormous transactions, customer and product data but rarely use them in a form that drives decisions. The main business decision is to use raw data to convert it into information and insights that helps in making better informed business decesions. Raw rule lists are hard to scan and the network structure is invisible in a table. Customer segmentation according to loyalty and shopping behaviour is a true insight that can help in targeted marketing. The goal was to build a single interface that lets you go from a high level KPI to a specific product pair to customer segments that drive it in seconds displaying visually, without writing SQL.
Dataset
3.2M
Orders
33M
Line items
49.7k
Products
206k
Customers
The dataset covers over 3.2 million anonymised Instacart orders from ~200k customers over several months. Each order contains a sequence of product IDs with aisle and department metadata. My offline pipeline converts them to Parquet for ~10× faster reads and then produces a set of pre-computed product artifacts that the API displays at startup.
Methodology
1. Association rule mining by FP-Growth
Frequent Pattern Growth was chosen over Apriori for its single pass tree structure that avoids the need for candidate generation bottleneck. Mining was run on a 500k-order sample yielding around 4032 rules. The key challenge was memory: mlxtend's fpgrowth() forces a dense materialization of the one-hot matrix on the full item. The fix was to pre filter items to those meeting min support before encoding, reducing the matrix to a manageable size.
2. Product network
Rules are projected as a directed graph where each antecedent→consequent pair becomes an edge weighted by lift. The top 500 rules by lift produce a 134 node, 437 edge graph rendered network using the cose bilkent force directed layout. Nodes are coloured by aisle so cross category affinity clusters are immediately visible.
3. RFM segmentation
Each customer is scored on Recency (days since last order), Frequency (total orders) and Monetary (total items). Scores are quintile bucketed and customers are k-means clustered in RFM space to produce behaviorally coherent segments. Each segment's over and under indexed aisles are computed as the ratio of the segment's aisle frequency to the population baseline giving us the "signature basket" for each group.
4. Live recompute
A POST /api/recompute endpoint runs FP Growth on 100k-order in memory sample, returning fresh rules and a new network graph. Results are LRU cached so repeated slider positions are instant.
Results
4,032
association rules mined
at 0.3% support, 5% confidence
6.83×
highest lift rule
Grapefruit sparkling water → Lime sparkling water
9
customer segments
k-means in RFM space ~206k users
Tech Stack
Frontend
- ·Next.js App Router
- ·TanStack Query
- ·Cytoscape.js
- ·Tailwind CSS
Backend
- ·FastAPI
- ·mlxtend FP-Growth
- ·pandas Parquet
- ·LRU-cached Python ETL