InstaCart User Data: A Market Basket Analysis & Recommendation Engine
Published:
This post is a walk-through of a detailed analysis of InstaCart user data, which is used to study shopping trends and create a machine learning interface to make product recommendations to users as they fill their carts.
π Instacart Market Basket Analysis β Predictive Deep Dive
This notebook provides an in-depth analysis of the Instacart dataset with a predictive focus.
Data sources:
orders.csvΒ·order_products__prior.csvΒ·order_products__train.csvΒ·products.csvΒ·aisles.csvΒ·departments.csvPrediction target: Given a userβs full purchase history (prior orders) and optionally their current in-session cart, predict which products they will buy next β evaluated against the held-out
trainorders.
π Table of Contents
| # | Section | Description |
|---|---|---|
| 1 | Setup & Data Loading | Import libraries, mount Google Drive, load all 6 CSV files from zip |
| 2 | Dataset Overview | Schema inspection, null checks, row counts, eval-set distribution |
| 3 | EDA β Orders & Timing | When do customers shop? Day-of-week, hour-of-day, order cadence |
| 4 | EDA β Products & Categories | Top products, department breakdown, aisle-level reorder rates |
| 5 | EDA β User Behavior | Cart-size distribution, reorder tendencies, user loyalty patterns |
| 6 | Feature Engineering | Product-, user-, and user-product interaction features for ML |
| 7 | Association Rule Mining | Market basket analysis with Apriori; support, confidence, and lift |
| 8 | Next-Item Prediction Strategies | 5 strategies β from a simple popularity baseline to LightGBM |
| 9 | Model Evaluation | Precision\@K, Recall\@K, F1\@K, NDCG\@K, Hit Rate\@K benchmarks |
| 10 | Interactive Demo & Leaderboard | Live recommendation demos and final strategy leaderboard |
1. Setup & Data Loading
All six Instacart CSV files are read directly from a zip archive on Google Drive. Casting columns to compact int8/int16/int32 dtypes immediately after loading keeps RAM manageable for the 30 M+ row dataset.
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import re
import zipfile
import os
from collections import defaultdict, Counter
from itertools import combinations
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
import warnings
from google.colab import drive
import gc
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (roc_auc_score, average_precision_score, classification_report)
from sklearn.model_selection import GroupShuffleSplit
import time
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.4f}'.format)
%matplotlib inline
print('β
Libraries loaded')
β
Libraries loaded
# Mount drive to access data
drive.mount('/content/drive/')
Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
data = {}
# Read files from zip
with zipfile.ZipFile('/content/drive/MyDrive/instacart/instacart.zip', 'r') as z:
file_list = [f for f in z.namelist() if f.endswith('.csv')]
print(f'Files in archive: {file_list}\n')
for fn in file_list:
key = fn.replace('.csv', '')
# Update
data[key] = pd.read_csv(z.open(fn))
print(f' β {fn}: {data[key].shape[0]:,} rows Γ {data[key].shape[1]} cols')
print('\nβ
All datasets loaded')
Files in archive: ['aisles.csv', 'departments.csv', 'order_products__prior.csv', 'order_products__train.csv', 'orders.csv', 'products.csv']
β aisles.csv: 134 rows Γ 2 cols
β departments.csv: 21 rows Γ 2 cols
β order_products__prior.csv: 32,434,489 rows Γ 4 cols
β order_products__train.csv: 1,384,617 rows Γ 4 cols
β orders.csv: 3,421,083 rows Γ 7 cols
β products.csv: 49,688 rows Γ 4 cols
β
All datasets loaded
2. Dataset Overview
The dataset ships as six files that join on shared keys:
| File | Key columns | Purpose |
|---|---|---|
orders | order_id, user_id | One row per order; includes timing and eval_set label |
order_products__prior | order_id, product_id | Every item in every prior order |
order_products__train | order_id, product_id | Items in the single held-out train order per user |
products | product_id, aisle_id, department_id | Product names and taxonomy |
aisles | aisle_id | Aisle labels |
departments | department_id | Department labels |
The eval_set column in orders partitions every userβs orders into three splits:
priorβ all historical orders used to build user profiles and train modelstrainβ one order per user that acts as the ground truth for model evaluationtestβ the competition hold-out (no order-product rows provided)
# Inspect schema
print('=' * 65)
print('DATASET SCHEMA & NULL REPORT')
print('=' * 65)
for name, df in data.items():
nulls = df.isnull().sum()
null_info = {c: n for c, n in nulls.items() if n > 0}
print(f'\nπ {name}')
print(f' Rows: {df.shape[0]:,} | Cols: {df.shape[1]}')
print(f' Columns: {list(df.columns)}')
if null_info:
print(f' β οΈ Nulls: {null_info}')
else:
print(f' β No nulls')
display(df.head(3))
=================================================================
DATASET SCHEMA & NULL REPORT
=================================================================
π aisles
Rows: 134 | Cols: 2
Columns: ['aisle_id', 'aisle']
β No nulls
| aisle_id | aisle | |
|---|---|---|
| 0 | 1 | prepared soups salads |
| 1 | 2 | specialty cheeses |
| 2 | 3 | energy granola bars |
π departments
Rows: 21 | Cols: 2
Columns: ['department_id', 'department']
β No nulls
| department_id | department | |
|---|---|---|
| 0 | 1 | frozen |
| 1 | 2 | other |
| 2 | 3 | bakery |
π order_products__prior
Rows: 32,434,489 | Cols: 4
Columns: ['order_id', 'product_id', 'add_to_cart_order', 'reordered']
β No nulls
| order_id | product_id | add_to_cart_order | reordered | |
|---|---|---|---|---|
| 0 | 2 | 33120 | 1 | 1 |
| 1 | 2 | 28985 | 2 | 1 |
| 2 | 2 | 9327 | 3 | 0 |
π order_products__train
Rows: 1,384,617 | Cols: 4
Columns: ['order_id', 'product_id', 'add_to_cart_order', 'reordered']
β No nulls
| order_id | product_id | add_to_cart_order | reordered | |
|---|---|---|---|---|
| 0 | 1 | 49302 | 1 | 1 |
| 1 | 1 | 11109 | 2 | 1 |
| 2 | 1 | 10246 | 3 | 0 |
π orders
Rows: 3,421,083 | Cols: 7
Columns: ['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow', 'order_hour_of_day', 'days_since_prior_order']
β οΈ Nulls: {'days_since_prior_order': 206209}
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
|---|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | NaN |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0000 |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0000 |
π products
Rows: 49,688 | Cols: 4
Columns: ['product_id', 'product_name', 'aisle_id', 'department_id']
β No nulls
| product_id | product_name | aisle_id | department_id | |
|---|---|---|---|---|
| 0 | 1 | Chocolate Sandwich Cookies | 61 | 19 |
| 1 | 2 | All-Seasons Salt | 104 | 13 |
| 2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
# Inspect the eval data
eval_counts = data['orders']['eval_set'].value_counts()
print('Orders by eval_set:')
for k, v in eval_counts.items():
print(f' {k:8s}: {v:>10,} orders ({v/len(data["orders"])*100:.1f}%)')
print(f'\nTotal unique users: {data["orders"]["user_id"].nunique():,}')
print(f'Total unique products: {data["products"].shape[0]:,}')
print(f'Total aisles: {data["aisles"].shape[0]} | Departments: {data["departments"].shape[0]}')
Orders by eval_set:
prior : 3,214,874 orders (94.0%)
train : 131,209 orders (3.8%)
test : 75,000 orders (2.2%)
Total unique users: 206,209
Total unique products: 49,688
Total aisles: 134 | Departments: 21
Note on the data split: Every user has exactly one
trainorder and onetestorder, with all remaining orders labelledprior. Models are trained usingpriorinteractions and evaluated by predicting the contents of thetrainorder. Thetestsplit has no product labels and is not used in this notebook.
3. EDA β Orders & Timing
Understanding when customers shop reveals weekly and intra-day demand rhythms that carry useful signals for models β e.g., a user who always shops on Sunday mornings is likely to continue that pattern. This section also examines how often users return and how their inter-order gap is distributed.
# Normalize product names (lowercase, strip punctuation)
data['products']['product_name'] = (data['products']['product_name'].apply(lambda x: re.sub(r'[^\w\s]', ' ', x.lower())).apply(lambda x: re.sub(r'\d+| ', '', x).strip()))
# Merge the codes with the labels
products = (data['products'].merge(data['aisles'], on = 'aisle_id', how = 'left').merge(data['departments'], on = 'department_id', how = 'left').drop(['aisle_id', 'department_id'], axis = 1).rename(columns = {'product_name': 'product'}))
print(f'Products enriched: {products.shape}')
products.head(3)
Products enriched: (49688, 4)
| product_id | product | aisle | department | |
|---|---|---|---|---|
| 0 | 1 | chocolate sandwich cookies | cookies cakes | snacks |
| 1 | 2 | all seasons salt | spices seasonings | pantry |
| 2 | 3 | robust golden unsweetened oolong tea | tea | beverages |
# Visualize order volumes
fig, axes = plt.subplots(1, 2, figsize=(20, 5))
dow_labels = {0: 'Sun', 1: 'Mon', 2: 'Tue', 3: 'Wed', 4: 'Thu', 5: 'Fri', 6: 'Sat'}
dow_counts = data['orders']['order_dow'].map(dow_labels).value_counts().reindex(['Sun','Mon','Tue','Wed','Thu','Fri','Sat'])
colors_dow = sns.color_palette('husl', 7)
bars = axes[0].bar(dow_counts.index, dow_counts.values, color = colors_dow, edgecolor = 'white', linewidth = 0.8)
axes[0].set_title('Orders by Day of Week', fontsize = 14, fontweight = 'bold')
axes[0].set_xlabel('Day of Week'); axes[0].set_ylabel('Number of Orders')
# Iterate
for bar, v in zip(bars, dow_counts.values):
axes[0].text(bar.get_x() + bar.get_width()/2, v + 1500, f'{v:,}', ha = 'center', va = 'bottom', fontsize = 8)
hour_counts = data['orders']['order_hour_of_day'].value_counts().sort_index()
axes[1].plot(hour_counts.index, hour_counts.values, marker = 'o', linewidth = 2.5, color = '#2196F3', markersize = 5)
axes[1].fill_between(hour_counts.index, hour_counts.values, alpha=0.15, color='#2196F3')
axes[1].axvspan(8, 11, alpha=0.1, color='orange', label='Peak morning (8-11)')
axes[1].axvspan(14, 17, alpha=0.1, color='green', label='Peak afternoon (14-17)')
axes[1].set_title('Orders by Hour of Day', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Hour of Day'); axes[1].set_ylabel('Number of Orders')
axes[1].set_xticks(range(0, 24)); axes[1].legend()
plt.suptitle('When Do Customers Shop?', fontsize=16, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()
print(f'Peak ordering day: {dow_counts.idxmax()} ({dow_counts.max():,} orders)')
print(f'Peak ordering hour: {hour_counts.idxmax()}:00 ({hour_counts.max():,} orders)')
Peak ordering day: Sun (600,905 orders)
Peak ordering hour: 10:00 (288,418 orders)
# Visualize user data
fig, axes = plt.subplots(1, 2, figsize = (20, 5))
# Days since prior order
dspo = data['orders']['days_since_prior_order'].dropna()
axes[0].hist(dspo, bins = 31, color = '#4CAF50', edgecolor = 'white', alpha = 0.85)
axes[0].axvline(dspo.mean(), color = 'red', linestyle = '--', lw = 2, label = f'Mean: {dspo.mean():.1f}d')
axes[0].axvline(dspo.median(), color = 'orange', linestyle = '--', lw = 2, label = f'Median: {dspo.median():.0f}d')
axes[0].axvline(7, color = 'gray', linestyle = ':', alpha = 0.7, label = 'Weekly')
axes[0].axvline(30, color = 'gray', linestyle = ':', alpha = 0.7, label = 'Monthly')
axes[0].set_title('Days Since Prior Order Distribution', fontsize = 14, fontweight = 'bold')
axes[0].set_xlabel('Days'); axes[0].set_ylabel('Frequency')
axes[0].legend()
# Orders per user
orders_per_user = data['orders'].groupby('user_id')['order_id'].count()
axes[1].hist(orders_per_user, bins=50, color='#9C27B0', edgecolor='white', alpha=0.85)
axes[1].axvline(orders_per_user.mean(), color='red', linestyle='--', lw=2, label=f'Mean: {orders_per_user.mean():.1f}')
axes[1].set_title('Orders per User (all-time)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Number of Orders'); axes[1].set_ylabel('Number of Users')
axes[1].legend()
plt.tight_layout()
plt.show()
print(f'Average days between orders: {dspo.mean():.1f}')
print(f'Users who order weekly (β€7d): {(dspo <= 7).mean()*100:.1f}%')
print(f'Users who order monthly (~30d): {(dspo.between(25,35)).mean()*100:.1f}%')
print(f'Users with 10+ orders: {(orders_per_user >= 10).sum():,} ')
print(f'({(orders_per_user >= 10).mean()*100:.1f}%)')
Average days between orders: 11.1
Users who order weekly (β€7d): 50.4%
Users who order monthly (~30d): 14.8%
Users with 10+ orders: 110,728
(53.7%)
# Truncate the data and track memory
all_user_ids = data['orders']['user_id'].unique()
sampled_uids = set(np.random.default_rng(100).choice(all_user_ids, size = min(150000, len(all_user_ids)), replace = False))
print(f'Working with {len(sampled_uids):,} sampled users')
# Filter the orders index to only sampled users β still cheap
orders_slim = data['orders'][data['orders']['user_id'].isin(sampled_uids)][['order_id','user_id','eval_set','order_number', 'order_dow','order_hour_of_day','days_since_prior_order']]
# Cast to smaller dtypes immediately
orders_slim = orders_slim.astype({'order_id': 'int32',
'user_id': 'int32',
'order_number': 'int16',
'order_dow': 'int8',
'order_hour_of_day': 'int8',})
print(f'orders_slim: {orders_slim.shape} | {orders_slim.memory_usage(deep = True).sum()/1e6:.1f} MB')
# Get only the order_ids belonging to sampled users
sampled_order_ids = set(orders_slim['order_id'].values)
# Filter prior and train BEFORE any merge
prior_raw = data['order_products__prior'][data['order_products__prior']['order_id'].isin(sampled_order_ids)].astype({'order_id':'int32','product_id':'int32','add_to_cart_order':'int16','reordered':'int8'})
train_raw = data['order_products__train'][data['order_products__train']['order_id'].isin(sampled_order_ids)].astype({'order_id':'int32','product_id':'int32','add_to_cart_order':'int16','reordered':'int8'})
print(f'prior_raw: {prior_raw.shape[0]:,} rows | {prior_raw.memory_usage(deep = True).sum()/1e6:.1f} MB')
print(f'train_raw: {train_raw.shape[0]:,} rows | {train_raw.memory_usage(deep = True).sum()/1e6:.1f} MB')
# Downcast products lookup (tiny, but keeps dtypes consistent)
products_slim = products
products_slim['product_id'] = products_slim['product_id'].astype('int32')
# Enrich prior and train separately β small frames now, safe to merge
def enrich(raw_df, eval_label):
return (raw_df.merge(products_slim, on = 'product_id', how = 'left').merge(orders_slim.drop(columns = 'eval_set'), on = 'order_id', how = 'left').assign(eval_set=eval_label))
prior_op = enrich(prior_raw, 'prior')
del prior_raw; gc.collect()
train_op = enrich(train_raw, 'train')
del train_raw; gc.collect()
orders_master = pd.concat([prior_op, train_op], ignore_index = True)
print(f'\norders_master: {orders_master.shape}')
print(f' prior rows: {prior_op.shape[0]:,}')
print(f' train rows: {train_op.shape[0]:,}')
print(f' total RAM: {orders_master.memory_usage(deep=True).sum()/1e6:.1f} MB')
Working with 150,000 sampled users
orders_slim: (2485904, 7) | 203.8 MB
prior_raw: 23,525,045 rows | 447.0 MB
train_raw: 1,007,429 rows | 19.1 MB
orders_master: (24532474, 13)
prior rows: 23,525,045
train rows: 1,007,429
total RAM: 6748.3 MB
4. EDA β Products & Categories
This section examines the product catalogue: which items are ordered most, how order volume breaks down by department, and β crucially β which categories command the highest reorder rates. High reorder rate is the strongest single predictor used in later models.
# Visualize top products
fig, axes = plt.subplots(1, 3, figsize = (20, 6))
# Top 20 products
top_prod = prior_op['product'].value_counts().head(20)
axes[0].barh(top_prod.index[::-1], top_prod.values[::-1], color = sns.color_palette('viridis', 20))
axes[0].set_title('Top 20 Products (Prior Orders)', fontsize = 12, fontweight = 'bold')
axes[0].set_xlabel('Order Count')
for i, v in enumerate(top_prod.values[::-1]):
axes[0].text(v + 1000, i, f'{v:,}', va = 'center', fontsize = 7)
# Department breakdown
dept_counts = prior_op['department'].value_counts().head(12)
axes[1].bar(range(len(dept_counts)), dept_counts.values, color = sns.color_palette('rocket', len(dept_counts)))
axes[1].set_xticks(range(len(dept_counts)))
axes[1].set_xticklabels(dept_counts.index, rotation=40, ha='right', fontsize=9)
axes[1].set_title('Orders by Department', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Order Count')
# Reorder rate by department
dept_reorder = prior_op.groupby('department')['reordered'].mean().sort_values(ascending=False).head(12)
bars = axes[2].barh(dept_reorder.index[::-1], dept_reorder.values[::-1], color = sns.color_palette('mako', len(dept_reorder)))
axes[2].axvline(0.5, color='red', linestyle='--', alpha=0.6, label='50% line')
axes[2].set_title('Reorder Rate by Department', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Reorder Rate')
axes[2].legend()
for bar in bars:
w = bar.get_width()
axes[2].text(w + 0.005, bar.get_y() + bar.get_height()/2, f'{w:.1%}', va='center', fontsize=8)
plt.suptitle('Product & Category Landscape', fontsize=16, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()
print(f'Overall reorder rate: {prior_op["reordered"].mean()*100:.1f}%')
print(f'Highest reorder dept: {dept_reorder.idxmax()} ({dept_reorder.max():.1%})')
print(f'Lowest reorder dept: {dept_reorder.idxmin()} ({dept_reorder.min():.1%})')
Overall reorder rate: 58.9%
Highest reorder dept: dairy eggs (67.0%)
Lowest reorder dept: breakfast (56.1%)
# Visualize frequency of product categories
aisle_summary = prior_op.groupby('aisle').agg(order_count = ('order_id', 'count'), reorder_rate = ('reordered', 'mean')).reset_index().sort_values('order_count', ascending = False).head(30)
fig, ax = plt.subplots(figsize=(20, 8))
scatter = ax.scatter(aisle_summary['order_count'],
aisle_summary['reorder_rate'],
c = aisle_summary['reorder_rate'],
cmap = 'RdYlGn',
s = 100,
alpha = 0.8,
edgecolors = 'gray',
linewidth = 0.5)
for _, row in aisle_summary.iterrows():
ax.annotate(row['aisle'], (row['order_count'], row['reorder_rate']), fontsize = 7, ha = 'left', va = 'bottom', alpha = 0.85)
plt.colorbar(scatter, ax=ax, label='Reorder Rate')
ax.set_xlabel('Total Orders (Prior)', fontsize=12)
ax.set_ylabel('Reorder Rate', fontsize=12)
ax.set_title('Top 30 Aisles: Volume vs Reorder Rate\n(high-right = popular & habitual)', fontsize = 13, fontweight='bold')
ax.axhline(prior_op['reordered'].mean(), color='gray', linestyle='--', alpha=0.5, label='Avg reorder rate')
ax.legend()
plt.tight_layout()
plt.show()
5. EDA β User Behavior
Having explored the product side, we now profile the users: how large are their baskets, how habitual are they (do they keep buying the same items?), and how does reorder tendency vary across the user base? These patterns motivate several of the features engineered in Section 6.
# Visualize cart size and user behavior patterns
fig, axes = plt.subplots(1, 3, figsize=(20, 5))
# Cart size distribution
cart_sizes = prior_op.groupby('order_id')['product_id'].count()
axes[0].hist(cart_sizes.clip(upper=50), bins=50, color='#FF5722', edgecolor='white', alpha=0.85)
axes[0].axvline(cart_sizes.mean(), color='black', linestyle='--', lw = 2, label = f'Mean: {cart_sizes.mean():.1f}')
axes[0].set_title('Cart Size Distribution', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Items per Order (capped at 50)'); axes[0].set_ylabel('Orders')
axes[0].legend()
# Reorder rate by cart position
atco = prior_op.groupby('add_to_cart_order')['reordered'].mean()
atco = atco[atco.index <= 30]
axes[1].plot(atco.index, atco.values, marker='o', linewidth=2.5, color='#FF9800')
axes[1].fill_between(atco.index, atco.values, alpha=0.15, color='#FF9800')
axes[1].set_title('Reorder Rate by Cart Position', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Add-to-Cart Position'); axes[1].set_ylabel('Reorder Rate')
axes[1].set_ylim(0, 1)
axes[1].annotate('First items\nare most habitual', xy=(1, atco.iloc[0]), xytext=(5, 0.85), arrowprops=dict(arrowstyle='->', color='gray'), fontsize=9)
# User reorder rate distribution
user_reorder = prior_op.groupby('user_id')['reordered'].mean()
axes[2].hist(user_reorder, bins=40, color='#3F51B5', edgecolor='white', alpha=0.85)
axes[2].axvline(user_reorder.mean(), color='red', linestyle='--', lw=2,
label=f'Mean: {user_reorder.mean():.2f}')
axes[2].set_title('User Reorder Rate Distribution', fontsize=13, fontweight='bold')
axes[2].set_xlabel('Reorder Rate'); axes[2].set_ylabel('Number of Users')
axes[2].legend()
plt.suptitle('User Behavior Patterns', fontsize=16, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()
print(f'Average cart size: {cart_sizes.mean():.1f} items')
print(f'Median cart size: {cart_sizes.median():.0f} items')
print(f'% users with reorder rate > 50%: {(user_reorder > 0.5).mean()*100:.1f}%')
print(f'First-position reorder rate: {atco.iloc[0]:.1%}')
Average cart size: 10.1 items
Median cart size: 8 items
% users with reorder rate > 50%: 38.4%
First-position reorder rate: 67.7%
6. Feature Engineering
We engineer three tiers of features that capture different aspects of purchasing behaviour:
| Tier | Features | Captures |
|---|---|---|
| Product | Order count, reorder rate, avg/std cart position, global popularity rank | How popular and habitual a product is across all users |
| User | Total orders, avg cart size, reorder rate, avg/std days between orders, variety ratio | A userβs overall shopping style and cadence |
| User Γ Product | Times ordered, reorder rate, avg cart position, recency gap, reorder score | How strongly this user is attached to this product |
The reorder_score = (times_ordered Γ reorder_rate) / (orders_since_last + 1) acts as a recency-weighted loyalty signal and is used in several strategies.
## <a id="section-7"></a>7. Association Rule Mining (Market Basket Analysis)
We apply the **Apriori** algorithm to a stratified sample of prior orders to discover which products are frequently purchased together.
**Key metrics:**
- **Support** β fraction of baskets containing the itemset; a measure of how common the pair is
- **Confidence** β P(B | A), the probability B is bought given A is already in the basket
- **Lift** β confidence Γ· P(B alone); values > 1 indicate a genuine positive association beyond chance
**Sampling strategy:**
1. Draw 50,000 users from the large `prior` set
2. Use only each user's *most recent* prior order (most predictive of future behaviour)
3. Restrict the product space to the top-5,000 items to keep the boolean matrix tractable
4. Set `min_support = 0.01` (at least 1% of sample baskets) so item pairs are discoverable
5. Retain only rules with `lift > 1.0` β positive associations only
The resulting rule dictionaries (`rule_lookup`, `rule_single`) are used directly by **Strategy 3** (Association Rules recommender).
Product features shape: (49549, 11)
Top 10 most ordered products:
| product | department | order_count | reorder_rate | unique_users | avg_cart_position | |
|---|---|---|---|---|---|---|
| 0 | banana | produce | 342433 | 0.8432 | 53698 | 4.8967 |
| 1 | bag of organic bananas | produce | 274530 | 0.8316 | 46219 | 5.0874 |
| 2 | organic strawberries | produce | 191724 | 0.7767 | 42804 | 7.2159 |
| 3 | organic baby spinach | produce | 175904 | 0.7720 | 40100 | 7.3903 |
| 4 | organic hass avocado | produce | 155268 | 0.7971 | 31499 | 6.7534 |
| 5 | organic avocado | produce | 128942 | 0.7588 | 31102 | 6.4500 |
| 6 | large lemon | produce | 110732 | 0.6961 | 33648 | 7.9196 |
| 7 | strawberries | produce | 103918 | 0.6991 | 31274 | 7.1101 |
| 8 | limes | produce | 100932 | 0.6787 | 32429 | 8.5485 |
| 9 | organic raspberries | produce | 99192 | 0.7678 | 23036 | 7.2162 |
# Derive user features
user_order_enriched = prior_op.merge(data['orders'][['order_id','order_number','days_since_prior_order']], on = 'order_id', how = 'left', suffixes = ('','_o'))
user_stats = user_order_enriched.groupby('user_id').agg(total_orders = ('order_id', 'nunique'),
total_items_purchased = ('product_id', 'count'),
unique_products = ('product_id', 'nunique'),
avg_cart_size = ('order_id', lambda x: x.count() / x.nunique()),
reorder_rate = ('reordered', 'mean'),
avg_days_between = ('days_since_prior_order', 'mean'),
std_days_between = ('days_since_prior_order', 'std'),).reset_index()
user_stats['variety_ratio'] = (user_stats['unique_products'] / user_stats['total_items_purchased'])
print(f'User features shape: {user_stats.shape}')
print('\nUser statistics summary:')
display(user_stats[['total_orders','avg_cart_size','reorder_rate', 'avg_days_between','variety_ratio']].describe())
User features shape: (150000, 9)
User statistics summary:
| total_orders | avg_cart_size | reorder_rate | avg_days_between | variety_ratio | |
|---|---|---|---|---|---|
| count | 150000.0000 | 150000.0000 | 150000.0000 | 150000.0000 | 150000.0000 |
| mean | 15.5727 | 9.9431 | 0.4320 | 15.4821 | 0.5680 |
| std | 16.6764 | 5.8516 | 0.2121 | 7.2099 | 0.2121 |
| min | 3.0000 | 1.0000 | 0.0000 | 0.0000 | 0.0105 |
| 25% | 5.0000 | 5.7500 | 0.2676 | 9.5582 | 0.4046 |
| 50% | 9.0000 | 8.9333 | 0.4286 | 14.7143 | 0.5714 |
| 75% | 19.0000 | 13.0000 | 0.5954 | 20.7500 | 0.7324 |
| max | 99.0000 | 70.2500 | 0.9895 | 30.0000 | 1.0000 |
# Find user/product interactions
up_stats = user_order_enriched.groupby(['user_id','product_id']).agg(times_ordered = ('order_id', 'count'),
times_reordered = ('reordered', 'sum'),
avg_cart_position = ('add_to_cart_order', 'mean'),
last_order_num = ('order_number', 'max')).reset_index()
user_max_order = (data['orders'][data['orders']['eval_set'] == 'prior'].groupby('user_id')['order_number'].max().reset_index().rename(columns = {'order_number': 'max_order_num'}))
up_stats = up_stats.merge(user_max_order, on='user_id', how='left')
up_stats['orders_since_last'] = up_stats['max_order_num'] - up_stats['last_order_num']
up_stats['reorder_rate'] = up_stats['times_reordered'] / up_stats['times_ordered'].clip(lower=1)
# Recency-weighted score: high orders + high reorder rate + low gap
up_stats['reorder_score'] = (up_stats['times_ordered'] * up_stats['reorder_rate']) / (up_stats['orders_since_last'] + 1)
print(f'User-product interaction features: {up_stats.shape}')
print(f'Avg interactions per user: {len(up_stats)/up_stats["user_id"].nunique():.1f} products')
up_stats.head()
User-product interaction features: (9657625, 10)
Avg interactions per user: 64.4 products
| user_id | product_id | times_ordered | times_reordered | avg_cart_position | last_order_num | max_order_num | orders_since_last | reorder_rate | reorder_score | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 196 | 10 | 9 | 1.4000 | 10 | 10 | 0 | 0.9000 | 9.0000 |
| 1 | 1 | 10258 | 9 | 8 | 3.3333 | 10 | 10 | 0 | 0.8889 | 8.0000 |
| 2 | 1 | 10326 | 1 | 0 | 5.0000 | 5 | 10 | 5 | 0.0000 | 0.0000 |
| 3 | 1 | 12427 | 10 | 9 | 3.3000 | 10 | 10 | 0 | 0.9000 | 9.0000 |
| 4 | 1 | 13032 | 3 | 2 | 6.3333 | 10 | 10 | 0 | 0.6667 | 2.0000 |
7. Association Rule Mining (Market Basket Analysis)
We use Apriori on a stratified sample of prior orders.
Here we:
- Sample from a larg prior set (β₯5 000 orders)
- Use
min_support=0.01(1% of sample baskets) so pairs are discoverable - Filter rules to
lift > 1.0(positive association only)
# Get a sample, increase for more comprehensive rules (takes longer)
rng = np.random.default_rng(100)
sample_uids = rng.choice(prior_op['user_id'].unique(), size = 50000, replace = False)
# Use the most recent prior order per sampled user (most predictive of future)
recent_prior_orders = (data['orders'][(data['orders']['user_id'].isin(sample_uids)) & (data['orders']['eval_set'] == 'prior')].sort_values(['user_id', 'order_number']).groupby('user_id').last().reset_index()[['user_id', 'order_id']])
sample_op = prior_op[prior_op['order_id'].isin(recent_prior_orders['order_id'])]
print(f'Sample baskets: {sample_op["order_id"].nunique():,}')
print(f'Avg basket size: {sample_op.groupby("order_id")["product"].count().mean():.1f} items')
print(f'Total unique products: {sample_op["product"].nunique():,}')
Sample baskets: 50,000
Avg basket size: 10.4 items
Total unique products: 30,430
# Reduce the product space by filtering the top products
top_products = (sample_op['product'].value_counts().head(5000).index.tolist())
top_product_set = set(top_products)
# Filter each basket to only include top-N products
filtered_baskets = (sample_op[sample_op['product'].isin(top_product_set)].groupby('order_id')['product'].apply(list).tolist())
# Drop baskets that now have fewer than 2 items (no pairs possible)
filtered_baskets = [b for b in filtered_baskets if len(b) >= 2]
print(f'Baskets after filtering: {len(filtered_baskets):,}')
print(f'Avg basket size (top-N only): {np.mean([len(b) for b in filtered_baskets]):.1f} items')
print(f'Product space: {len(top_products):,} items')
print(f'\nMin support threshold (1%): {int(0.01 * len(filtered_baskets)):,} baskets required')
Baskets after filtering: 45,405
Avg basket size (top-N only): 9.1 items
Product space: 1000 items
Min support threshold (1%): 454 baskets required
# Perform apriori
te = TransactionEncoder()
te_array = te.fit_transform(filtered_baskets)
basket_df = pd.DataFrame(te_array, columns = te.columns_)
# Get frequencies
frequent_itemsets = apriori(basket_df, min_support = .01, use_colnames = True, max_len = 2)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(len)
# Mine the rules
rules = association_rules(frequent_itemsets, metric = 'confidence', min_threshold = 0.05)
rules = rules[rules['lift'] > 1.0].sort_values('lift', ascending = False).reset_index(drop = True)
# Summarize results
print(f'Frequent itemsets: {len(frequent_itemsets):,}')
print(f' Singletons: {(frequent_itemsets["length"]==1).sum()}')
print(f' Pairs: {(frequent_itemsets["length"]==2).sum()}')
print(f'\nAssociation rules (lift > 1): {len(rules):,}')
print('\nTop 20 rules by lift:')
display(rules[['antecedents','consequents','support','confidence','lift']].head(20).assign(antecedents=lambda df: df['antecedents'].apply(lambda x: list(x)[0]), consequents = lambda df: df['consequents'].apply(lambda x: list(x)[0])))
Frequent itemsets: 134
Singletons: 115
Pairs: 19
Association rules (lift > 1): 38
Top 20 rules by lift:
| antecedents | consequents | support | confidence | lift | |
|---|---|---|---|---|---|
| 0 | large lemon | limes | 0.0115 | 0.1746 | 3.6959 |
| 1 | limes | large lemon | 0.0115 | 0.2429 | 3.6959 |
| 2 | organic raspberries | organic strawberries | 0.0121 | 0.2622 | 3.1958 |
| 3 | organic strawberries | organic raspberries | 0.0121 | 0.1474 | 3.1958 |
| 4 | bag of organic bananas | organic hass avocado | 0.0198 | 0.1536 | 2.5129 |
| 5 | organic hass avocado | bag of organic bananas | 0.0198 | 0.3242 | 2.5129 |
| 6 | organic fuji apple | banana | 0.0105 | 0.3856 | 2.4618 |
| 7 | banana | organic fuji apple | 0.0105 | 0.0671 | 2.4618 |
| 8 | organic raspberries | bag of organic bananas | 0.0143 | 0.3095 | 2.3986 |
| 9 | bag of organic bananas | organic raspberries | 0.0143 | 0.1106 | 2.3986 |
| 10 | organic hass avocado | organic strawberries | 0.0117 | 0.1913 | 2.3316 |
| 11 | organic strawberries | organic hass avocado | 0.0117 | 0.1426 | 2.3316 |
| 12 | banana | honeycrisp apple | 0.0102 | 0.0648 | 2.3303 |
| 13 | honeycrisp apple | banana | 0.0102 | 0.3650 | 2.3303 |
| 14 | organic baby spinach | organic avocado | 0.0117 | 0.1406 | 2.2902 |
| 15 | organic avocado | organic baby spinach | 0.0117 | 0.1905 | 2.2902 |
| 16 | organic strawberries | bag of organic bananas | 0.0231 | 0.2813 | 2.1807 |
| 17 | bag of organic bananas | organic strawberries | 0.0231 | 0.1789 | 2.1807 |
| 18 | organic hass avocado | organic baby spinach | 0.0105 | 0.1718 | 2.0662 |
| 19 | organic baby spinach | organic hass avocado | 0.0105 | 0.1263 | 2.0662 |
# Visualize association rules
fig, axes = plt.subplots(1, 2, figsize=(20, 6))
# Scatter: support Γ confidence, coloured by lift
sc = axes[0].scatter(rules['support'], rules['confidence'], c = rules['lift'], cmap = 'YlOrRd', alpha = 0.7, s = 40, edgecolors = 'gray', linewidth = 0.3)
plt.colorbar(sc, ax = axes[0], label = 'Lift')
axes[0].set_xlabel('Support'); axes[0].set_ylabel('Confidence')
axes[0].set_title('Support Γ Confidence (colour = Lift)', fontsize=12, fontweight='bold')
# Top 15 rules by lift β horizontal bar
top15 = rules.head(15).copy()
top15['rule_label'] = top15.apply(lambda r: f"{list(r['antecedents'])[0][:22]} β {list(r['consequents'])[0][:22]}", axis=1)
bar_colors = sns.color_palette('YlOrRd', 15)
axes[1].barh(top15['rule_label'][::-1], top15['lift'][::-1], color=bar_colors)
axes[1].axvline(1.0, color='gray', linestyle='--', alpha=0.6, label='Lift = 1 (independent)')
axes[1].set_xlabel('Lift'); axes[1].set_title('Top 15 Rules by Lift', fontsize=12, fontweight='bold')
axes[1].legend(); axes[1].tick_params(axis='y', labelsize=8)
plt.suptitle('Market Basket β Association Rules', fontsize=15, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()
# Build rule lookup dicts (used by prediction model) ant_frozenset β [(confidence, lift, consequent_name)]
rule_lookup = defaultdict(list)
for _, row in rules.iterrows():
ant = frozenset(row['antecedents'])
con = list(row['consequents'])[0]
rule_lookup[ant].append((row['confidence'], row['lift'], con))
# single item β [(conf, lift, consequent, full_antecedent)]
rule_single = defaultdict(list)
for ant, recs in rule_lookup.items():
for item in ant:
for conf, lift, con in recs:
rule_single[item].append((conf, lift, con, ant))
# Helper lookup dicts
pid_to_name = dict(zip(products['product_id'], products['product']))
name_to_pid = dict(zip(products['product'], products['product_id']))
pid_to_dept = dict(zip(products['product_id'], products['department']))
print(f'Rule lookup entries: {len(rule_lookup)}')
print(f'Single-item lookup: {len(rule_single)} distinct items have rules')
Rule lookup entries: 12
Single-item lookup: 12 distinct items have rules
8. Next-Item Prediction Strategies
Five strategies recommend products as the user builds their cart, each progressively richer in how it uses purchase history and the sequence of items already added:
| # | Strategy | Cart-Order Signal | ML? | Cold-start fallback |
|---|---|---|---|---|
| S1 | Global Popularity | β | No | β (is the fallback) |
| S2 | Personal History | β | No | Falls back to S1 |
| S3 | Association Rules | Cart contents | No | Falls back to S2 |
| S4 | Sequential Markov Chain | Transition + position probs | No | Falls back to S2 |
| S5 | LightGBM (cart-aware) | Markov + position features | β | Falls back to S1 |
All strategies share the same interface:
fn(user_id, cart_sequence, K=10) β [product_id, ...]
where cart_sequence is the ordered list of product_ids added to cart so far. This uniform signature makes it trivial to swap strategies in the evaluation harness and the final demo.
# Develop the ground truth for model scoring
train_truth = (train_op.groupby('user_id')['product_id'].apply(set).reset_index().rename(columns = {'product_id': 'true_items'}))
# Simulate "current shopping session" using the LAST prior order per user,
last_prior_oids = (orders_slim[orders_slim['eval_set'] == 'prior'].sort_values(['user_id', 'order_number']).groupby('user_id')['order_id'].last().reset_index().rename(columns = {'order_id': 'last_oid'}))
last_prior_seq = (prior_op[prior_op['order_id'].isin(last_prior_oids['last_oid'])].sort_values(['order_id', 'add_to_cart_order']).groupby('order_id')['product_id'].apply(list).reset_index().rename(columns = {'order_id': 'last_oid', 'product_id': 'cart_sequence'}))
eval_df = (train_truth.merge(last_prior_oids, on = 'user_id').merge(last_prior_seq, on = 'last_oid', how = 'inner'))
# Reproducible subsample for fast evaluation across all strategies
eval_sample = eval_df.sample(n = min(3000, len(eval_df)), random_state = 100).reset_index(drop = True)
print(f"Evaluation set : {len(eval_df):,} users")
print(f"Eval sample : {len(eval_sample):,} users")
print(f"Avg true basket : {eval_sample['true_items'].apply(len).mean():.1f} items")
print(f"Avg cart length : {eval_sample['cart_sequence'].apply(len).mean():.1f} items")
eval_sample[['user_id','cart_sequence','true_items']].head(3)
Evaluation set : 95,361 users
Eval sample : 3,000 users
Avg true basket : 10.5 items
Avg cart length : 10.6 items
| user_id | cart_sequence | true_items | |
|---|---|---|---|
| 0 | 15188 | [34915, 47766, 43183, 35406, 49054, 39993, 457... | {46656, 34915, 46724, 264, 41290, 35406, 5296,... |
| 1 | 23617 | [41177] | {42585, 18740} |
| 2 | 13611 | [38895, 24852, 9076, 16617, 33294, 45007, 2784... | {17794, 44359, 2380, 40174, 38895, 1359, 47601... |
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Strategy 1 β Global Popularity Baseline
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Recommend the most-ordered products globally. No personalisation, no cart
# context β pure frequency signal. Serves as the floor baseline.
POPULARITY_POOL = (product_stats.sort_values('order_count', ascending=False)['product_id'].head(300).tolist())
def s1_popularity(user_id, cart_sequence, K = 10, **_):
"""Top-K globally popular products not already in the cart."""
in_cart = set(cart_sequence)
return [p for p in POPULARITY_POOL if p not in in_cart][:K]
print("β
S1 (Global Popularity) ready")
print(f" Pool: top {len(POPULARITY_POOL)} products")
print(f" Sample recs: {[pid_to_name.get(p,'?')[:25] for p in s1_popularity(0, [], K=5)]}")
β
S1 (Global Popularity) ready
Pool: top 300 products
Sample recs: ['banana', 'bag of organic bananas', 'organic strawberries', 'organic baby spinach', 'organic hass avocado']
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Strategy 2 β Personal History
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Rank the user's previously purchased products by reorder_score
# (frequency Γ reorder-rate / recency gap), then return the top-K
# not yet in the current cart.
user_history = (up_stats.sort_values(['user_id', 'reorder_score'], ascending=[True, False]).groupby('user_id')['product_id'].apply(list).to_dict())
def s2_personal(user_id, cart_sequence, K = 10, **_):
"""Top-K products from user's personal purchase history by reorder_score."""
in_cart = set(cart_sequence)
return [p for p in user_history.get(user_id, []) if p not in in_cart][:K]
coverage = sum(1 for uid in eval_sample['user_id'] if uid in user_history)
print("β
S2 (Personal History) ready")
print(f" History coverage : {coverage}/{len(eval_sample)} ({coverage/len(eval_sample):.1%})")
_u0 = eval_sample['user_id'].iloc[0]
_c0 = eval_sample['cart_sequence'].iloc[0]
print(f" Sample recs (user {_u0}): {[pid_to_name.get(p,'?')[:22] for p in s2_personal(_u0,_c0,K=3)]}")
β
S2 (Personal History) ready
History coverage : 3000/3000 (100.0%)
Sample recs (user 15188): ['sweet red grape tomato', 'banana', 'whole wheat bread']
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Strategy 3 β Association Rules (cart-content-aware)
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Every item currently in the cart fires its association rules.
# Consequents are scored by the SUM of (confidence Γ lift) across all
# triggering cart items, then ranked. Falls back to personal history
# when the rule set yields fewer than K candidates.
# Build a fast name-keyed lookup: item_name β {consequent_name: agg_score}
assoc_lookup = defaultdict(lambda: defaultdict(float))
for ant_name, recs in rule_single.items():
for conf, lift, con_name, _ in recs:
assoc_lookup[ant_name][con_name] += conf * lift
def s3_rules(user_id, cart_sequence, K=10, **_):
"""Score candidates via association rules fired by the current cart."""
in_cart_ids = set(cart_sequence)
in_cart_names = {pid_to_name.get(p, '') for p in cart_sequence}
scores = defaultdict(float)
for name in in_cart_names:
for con_name, score in assoc_lookup.get(name, {}).items():
pid = name_to_pid.get(con_name)
if pid and pid not in in_cart_ids:
scores[pid] += score
ranked = sorted(scores, key=scores.get, reverse=True)
if len(ranked) >= K:
return ranked[:K]
# Backfill with personal history when rules are sparse
extra = [p for p in s2_personal(user_id, cart_sequence, K * 3) if p not in set(ranked)]
return (ranked + extra)[:K]
print("β
S3 (Association Rules) ready")
print(f" Assoc lookup items : {len(assoc_lookup)}")
print(f" Sample recs : {[pid_to_name.get(p,'?')[:22] for p in s3_rules(_u0,_c0,K = 3)]}")
β
S3 (Association Rules) ready
Assoc lookup items : 12
Sample recs : ['banana', 'organic baby spinach', 'sweet red grape tomato']
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Strategy 4 β Sequential Markov Chain (O(N) single-pass, no loops)
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
MAX_POS = 20
MAX_SUCCESSORS = 100
MAX_POS_ITEMS = 200
MAX_USER_HIST = 50
print("Building Markov transition matrixβ¦")
t0 = time.time()
# ββ Step 1: extract + sort ββββββββββββββββββββββββββββββββββββββββββββββββββββ
arr = prior_op[['order_id','product_id','add_to_cart_order']].to_numpy(dtype = 'int32')
arr = arr[np.lexsort((arr[:,2], arr[:,0]))]
oids = arr[:,0]; pids = arr[:,1]; pos0 = arr[:,2] - 1
del arr; gc.collect()
# ββ Step 2: transition pairs ββββββββββββββββββββββββββββββββββββββββββββββββββ
same = oids[:-1] == oids[1:]
prev_p = pids[:-1][same].astype('int32')
next_p = pids[1: ][same].astype('int32')
del same; gc.collect()
# Sort pairs so identical (prev,next) rows are adjacent β count with diff
order = np.lexsort((next_p, prev_p))
prev_s = prev_p[order]; next_s = next_p[order]
del prev_p, next_p, order; gc.collect()
# Find group boundaries in O(N) β no per-node mask scan
boundaries = np.flatnonzero(np.diff(prev_s)) + 1
boundaries = np.concatenate([[0], boundaries, [len(prev_s)]])
# Count runs of identical (prev,next) using diff on next_s within each prev group
pair_counts = np.diff(
np.concatenate([np.flatnonzero(
np.concatenate([[True], (prev_s[1:] != prev_s[:-1]) | (next_s[1:] != next_s[:-1]), [True]])
), [len(prev_s)]])
)
# Unique (prev,next) pairs
keep = np.concatenate([[True], (prev_s[1:] != prev_s[:-1]) | (next_s[1:] != next_s[:-1])])
u_prev = prev_s[keep]; u_next = next_s[keep]
del prev_s, next_s, keep; gc.collect()
# Group boundaries on the unique pairs
grp_bounds = np.flatnonzero(np.diff(u_prev)) + 1
grp_bounds = np.concatenate([[0], grp_bounds, [len(u_prev)]])
trans_prob = {}
for i in range(len(grp_bounds) - 1):
lo, hi = grp_bounds[i], grp_bounds[i+1]
node = int(u_prev[lo])
c = pair_counts[lo:hi]; t = u_next[lo:hi]
if len(c) > MAX_SUCCESSORS:
top = np.argpartition(c, -MAX_SUCCESSORS)[-MAX_SUCCESSORS:]
c, t = c[top], t[top]
total = c.sum()
trans_prob[node] = {int(ti): float(ci/total) for ti, ci in zip(t, c)}
del u_prev, u_next, pair_counts, grp_bounds; gc.collect()
print(f" trans_prob done ({len(trans_prob):,} nodes) {time.time()-t0:.1f}s")
# ββ Step 3: position-popularity βββββββββββββββββββββββββββββββββββββββββββββββ
mask = pos0 < MAX_POS
p_pos = pos0[mask].astype('int32'); p_pid = pids[mask].astype('int32')
del oids, pids, pos0, mask; gc.collect()
order2 = np.lexsort((p_pid, p_pos))
pp_s = p_pos[order2]; pid_s = p_pid[order2]
del p_pos, p_pid, order2; gc.collect()
keep2 = np.concatenate([[True], (pp_s[1:] != pp_s[:-1]) | (pid_s[1:] != pid_s[:-1])])
pc = np.diff(np.concatenate([np.flatnonzero(keep2), [len(pp_s)]]))
u_pos = pp_s[keep2]; u_pid = pid_s[keep2]
del pp_s, pid_s, keep2; gc.collect()
grp2 = np.flatnonzero(np.diff(u_pos)) + 1
grp2 = np.concatenate([[0], grp2, [len(u_pos)]])
pos_prob = {}
for i in range(len(grp2) - 1):
lo, hi = grp2[i], grp2[i+1]
pos = int(u_pos[lo])
c = pc[lo:hi]; t = u_pid[lo:hi]
if len(c) > MAX_POS_ITEMS:
top = np.argpartition(c, -MAX_POS_ITEMS)[-MAX_POS_ITEMS:]
c, t = c[top], t[top]
total = c.sum()
pos_prob[pos] = {int(ti): float(ci/total) for ti, ci in zip(t, c)}
del u_pos, u_pid, pc, grp2; gc.collect()
# ββ Step 4: per-user reorder-score list (capped) βββββββββββββββββββββββββββββ
user_rs_list = (up_stats
.sort_values('reorder_score', ascending=False)
.groupby('user_id').head(MAX_USER_HIST)
.groupby('user_id')
.apply(lambda g: list(zip(g['product_id'].tolist(),
g['reorder_score'].tolist())))
.to_dict())
print(f"β
Markov chain built in {time.time()-t0:.1f}s")
print(f" Source nodes : {len(trans_prob):,} products")
print(f" Avg successors per node : {np.mean([len(v) for v in trans_prob.values()]):.1f}")
print(f" Positions tracked : 0 β {MAX_POS-1}")
Building Markov transition matrixβ¦
trans_prob done (49,411 nodes) 24.8s
β
Markov chain built in 67.9s
Source nodes : 49,411 products
Avg successors per node : 48.3
Positions tracked : 0 β 19
# Strategy 4 β Sequential Markov Chain inference function
def s4_markov(user_id, cart_sequence, K = 10, n_gram = 2, alpha = 0.4, beta = 0.25, **_):
"""
Sequential Markov Chain recommender.
Score(X) = Ξ£ P(next=X | prev β cart[-n_gram:]) [Markov]
+ Ξ± Β· P(X appears at cart position len(cart)) [position]
+ Ξ² Β· reorder_score(user, X) [personal]
Parameters
----------
cart_sequence : list of product_ids in add-to-cart order
n_gram : how many trailing items to use as context (default 2)
alpha : weight for the position-popularity signal
beta : weight for the user's personal reorder score
"""
in_cart = set(cart_sequence)
next_pos = len(cart_sequence)
context = cart_sequence[-n_gram:] if len(cart_sequence) >= n_gram else cart_sequence
scores = defaultdict(float)
# (A) Markov transition scores from the last n_gram cart items
for prev in context:
for nxt, prob in trans_prob.get(prev, {}).items():
if nxt not in in_cart:
scores[nxt] += prob
# (B) Position-specific popularity at the *next* cart slot
capped_pos = min(next_pos, MAX_POS - 1)
for pid, prob in pos_prob.get(capped_pos, {}).items():
if pid not in in_cart:
scores[pid] += alpha * prob
# (C) Blend personal purchase affinity
for pid, rs in user_rs_list.get(user_id, []):
if pid not in in_cart and rs > 0:
scores[pid] += beta * float(rs)
ranked = sorted(scores, key=scores.get, reverse=True)
if len(ranked) >= K:
return ranked[:K]
extra = [p for p in s2_personal(user_id, cart_sequence, K * 2) if p not in set(ranked)]
return (ranked + extra)[:K]
# Apply function
print("β
S4 (Sequential Markov Chain) ready")
recs4 = s4_markov(_u0, _c0, K=3)
print(f" Sample recs : {[pid_to_name.get(p,'?')[:22] for p in recs4]}")
print()
print("Position signal demo β top 3 products per cart slot:")
print(f"{'Pos':>4} {'#1 product':<30} {'#2 product':<30} {'#3 product':<30}")
print("β" * 97)
for pos in range(6):
slot = pos_prob.get(pos, {})
top3 = sorted(slot, key=slot.get, reverse=True)[:3]
cols = [f"{pid_to_name.get(p,'?')[:26]} ({slot[p]:.2%})" for p in top3]
while len(cols) < 3:
cols.append("β")
print(f"{pos:>4} {cols[0]:<30} {cols[1]:<30} {cols[2]:<30}")
β
S4 (Sequential Markov Chain) ready
Sample recs : ['sweet red grape tomato', 'banana', 'whole wheat bread']
Position signal demo β top 3 products per cart slot:
Pos #1 product #2 product #3 product
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
0 banana (9.46%) bag of organic bananas (6.78%) organic whole milk (2.59%)
1 banana (7.48%) bag of organic bananas (5.99%) organic strawberries (2.78%)
2 banana (6.07%) bag of organic bananas (5.05%) organic strawberries (2.89%)
3 banana (5.04%) bag of organic bananas (4.25%) organic strawberries (2.91%)
4 banana (4.33%) bag of organic bananas (3.71%) organic strawberries (2.87%)
5 banana (3.95%) bag of organic bananas (3.29%) organic strawberries (2.77%)
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Strategy 5 β LightGBM feature matrix (chunked, bounded candidate set)
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Why previous versions OOM'd:
# β’ Full candidate matrix (8000 users Γ 300+ candidates) = 2.4M rows
# before a single feature column is added β too large to hold in RAM
# alongside prior_op, trans_prob, etc.
#
# Fix β two constraints that bound memory absolutely:
# 1. CHUNK_SIZE users processed at a time; each chunk is del'd before next
# 2. Candidate pool capped: top MAX_PERSONAL history + top MAX_GLOBAL pool
# β at most (MAX_PERSONAL + MAX_GLOBAL) rows per user, guaranteed
ML_N_USERS = 3000 # reduce if still tight; 3k is plenty for LightGBM
CHUNK_SIZE = 200 # users per chunk β peak RAM β 200 Γ 150 Γ ~20 cols
MAX_PERSONAL = 80 # top items from user history by reorder_score
MAX_GLOBAL = 80 # top globally popular items
# β max ~160 candidates per user, ~32k rows per chunk
ml_train_users = eval_df['user_id'].sample(n=ML_N_USERS, random_state=42).tolist()
ml_train_df = (eval_df[eval_df['user_id'].isin(ml_train_users)]
.reset_index(drop=True))
# Static lookups built once, outside the loop
prod_feat = product_stats[['product_id','order_count','reorder_rate', 'avg_cart_position','global_popularity']]
prod_feat['pop_rank'] = prod_feat['order_count'].rank(pct = True).astype('float32')
user_feat = user_stats[['user_id','total_orders','avg_cart_size',
'reorder_rate','avg_days_between','variety_ratio']]
up_feat = up_stats[['user_id','product_id','times_ordered','reorder_score', 'orders_since_last','avg_cart_position','reorder_rate']]
up_feat.rename(columns={'avg_cart_position':'up_avg_cart_pos', 'reorder_rate':'up_reorder_rate'}, inplace=True)
# Cap global pool
GLOBAL_POOL = POPULARITY_POOL[:MAX_GLOBAL]
# Pre-convert assoc_lookup to pid-keyed dict once (avoids repeated name lookups)
assoc_by_pid = defaultdict(dict)
for ant_name, cons in assoc_lookup.items():
ant_pid = name_to_pid.get(ant_name)
if not ant_pid:
continue
for con_name, score in cons.items():
con_pid = name_to_pid.get(con_name)
if con_pid:
assoc_by_pid[ant_pid][con_pid] = (assoc_by_pid[ant_pid].get(con_pid, 0.0) + score)
print(f"Users: {ML_N_USERS} | chunk: {CHUNK_SIZE} | "
f"max candidates/user: {MAX_PERSONAL + MAX_GLOBAL}")
print(f"Expected rows: ~{ML_N_USERS * (MAX_PERSONAL + MAX_GLOBAL) // 2:,}")
t0 = time.time()
chunks = []
for chunk_start in range(0, len(ml_train_df), CHUNK_SIZE):
chunk = ml_train_df.iloc[chunk_start : chunk_start + CHUNK_SIZE]
rows = []
for _, row in chunk.iterrows():
uid = int(row['user_id'])
true_set = row['true_items']
cart_seq = row['cart_sequence']
in_cart = set(cart_seq)
cart_sz = len(cart_seq)
# Bounded candidate set
personal = [p for p in user_history.get(uid, [])[:MAX_PERSONAL]
if p not in in_cart]
globl = [p for p in GLOBAL_POOL if p not in in_cart and p not in set(personal)]
candidates = personal + globl
if not candidates:
continue
# Cart-context scores β computed once per user, not per candidate
markov_sc = defaultdict(float)
for prev in cart_seq:
for nxt, p in trans_prob.get(prev, {}).items():
markov_sc[nxt] += p
assoc_sc = defaultdict(float)
for ant_pid in cart_seq:
for con_pid, sc in assoc_by_pid.get(ant_pid, {}).items():
assoc_sc[con_pid] += sc
capped_pos = min(cart_sz, MAX_POS - 1)
pos_p = pos_prob.get(capped_pos, {})
for pid in candidates:
rows.append((uid, pid,
int(pid in true_set),
cart_sz,
markov_sc.get(pid, 0.0),
assoc_sc.get(pid, 0.0),
pos_p.get(pid, 0.0)))
if not rows:
continue
c_df = pd.DataFrame(rows, columns=['user_id','product_id','label',
'cart_size','markov_score',
'assoc_score','pos_prob_next'])
c_df = (c_df
.merge(prod_feat, on='product_id', how='left')
.merge(user_feat, on='user_id', how='left')
.merge(up_feat, on=['user_id','product_id'], how='left')
.fillna(0))
chunks.append(c_df)
del c_df, rows
gc.collect()
if (chunk_start // CHUNK_SIZE) % 5 == 0:
done = min(chunk_start + CHUNK_SIZE, len(ml_train_df))
print(f" {done}/{len(ml_train_df)} users β¦", end="")
cand_df = pd.concat(chunks, ignore_index=True)
del chunks
gc.collect()
pos_rate = cand_df['label'].mean()
print(f"\nDone in {time.time()-t0:.1f}s | rows: {len(cand_df):,} "
f"| positive rate: {pos_rate:.3%}")
cand_df[['user_id','product_id','label','markov_score','assoc_score',
'pos_prob_next','cart_size','reorder_score']].head(5)
Users: 3000 | chunk: 200 | max candidates/user: 160
Expected rows: ~240,000
200/3000 users β¦ 1200/3000 users β¦ 2200/3000 users β¦
Done in 80.3s | rows: 333,300 | positive rate: 3.461%
| user_id | product_id | label | markov_score | assoc_score | pos_prob_next | cart_size | reorder_score | |
|---|---|---|---|---|---|---|---|---|
| 0 | 144 | 42342 | 0 | 0.0394 | 0.0000 | 0.0000 | 6 | 0.3333 |
| 1 | 144 | 13603 | 0 | 0.0000 | 0.0000 | 0.0000 | 6 | 0.1667 |
| 2 | 144 | 22963 | 0 | 0.0752 | 0.0000 | 0.0030 | 6 | 0.0000 |
| 3 | 144 | 29871 | 0 | 0.0233 | 0.0000 | 0.0000 | 6 | 0.0000 |
| 4 | 144 | 32887 | 0 | 0.0000 | 0.0000 | 0.0000 | 6 | 0.0000 |
# Define feature columns and train LightGBM with group-aware train/val split
FEATURE_COLS = ['cart_size', 'markov_score', 'assoc_score', 'pos_prob_next', 'order_count', 'reorder_rate', 'avg_cart_position', 'global_popularity', 'pop_rank', 'total_orders', 'avg_cart_size', 'reorder_rate_user', 'avg_days_between', 'variety_ratio', 'times_ordered', 'reorder_score', 'orders_since_last', 'up_avg_cart_pos', 'up_reorder_rate',]
# Some column names duplicated (reorder_rate from product + user) β rename defensively
cols = list(cand_df.columns)
seen = {}; new_cols = []
for c in cols:
if c in seen:
seen[c] += 1
new_cols.append(f"{c}_user" if c == "reorder_rate" else f"{c}_{seen[c]}")
else:
seen[c] = 0; new_cols.append(c)
cand_df.columns = new_cols
FEATURE_COLS = [c for c in FEATURE_COLS if c in cand_df.columns]
X = cand_df[FEATURE_COLS].values
y = cand_df['label'].values
groups = cand_df['user_id'].values
# Group-aware train/val split (no user leaks between folds)
gss = GroupShuffleSplit(n_splits=1, test_size=0.15, random_state=42)
tr_idx, va_idx = next(gss.split(X, y, groups))
X_tr, y_tr = X[tr_idx], y[tr_idx]
X_va, y_va = X[va_idx], y[va_idx]
# Scale negatives to 1 positive ~ 4 negatives (handles class imbalance)
pos_weight = max(1, int((y_tr == 0).sum() / max((y_tr == 1).sum(), 1) / 4))
lgb_model = lgb.LGBMClassifier(
n_estimators = 400,
learning_rate = 0.05,
num_leaves = 63,
max_depth = -1,
min_child_samples = 20,
subsample = 0.8,
colsample_bytree= 0.8,
scale_pos_weight= pos_weight,
n_jobs = -1,
random_state = 42,
verbose = -1,
)
print("Training LightGBM β¦")
t0 = time.time()
lgb_model.fit(X_tr, y_tr,
eval_set = [(X_va, y_va)],
callbacks = [lgb.early_stopping(30, verbose=False),
lgb.log_evaluation(50)])
print(f"Done in {time.time()-t0:.1f}s | best iteration: {lgb_model.best_iteration_}")
va_pred = lgb_model.predict_proba(X_va)[:, 1]
print(f"\nVal ROC-AUC : {roc_auc_score(y_va, va_pred):.4f}")
print(f"Val Avg-Prec : {average_precision_score(y_va, va_pred):.4f}")
# Feature importance
fi = pd.Series(lgb_model.feature_importances_, index=FEATURE_COLS).sort_values(ascending=False)
print("\nTop-10 feature importances:")
print(fi.head(10).to_string())
Training LightGBM β¦
Done in 1.9s | best iteration: 4
Val ROC-AUC : 0.8648
Val Avg-Prec : 0.2269
Top-10 feature importances:
orders_since_last 32
avg_cart_size 32
markov_score 26
variety_ratio 24
reorder_score 22
order_count 22
total_orders 21
cart_size 19
avg_cart_position 17
avg_days_between 12
# Strategy 5 β LightGBM inference function (mirrors training feature pipeline)
def s5_lgbm(user_id, cart_sequence, K=10, **_):
"""LightGBM recommender β mirrors the training feature pipeline exactly."""
in_cart = set(cart_sequence)
cart_sz = len(cart_sequence)
# Bounded candidates β same caps used at training time
personal = [p for p in user_history.get(user_id, [])[:MAX_PERSONAL] if p not in in_cart]
globl = [p for p in GLOBAL_POOL if p not in in_cart and p not in set(personal)]
candidates = personal + globl
if not candidates:
return s1_popularity(user_id, cart_sequence, K)
# Cart-context scores
markov_sc = defaultdict(float)
for prev in cart_sequence:
for nxt, p in trans_prob.get(prev, {}).items():
markov_sc[nxt] += p
assoc_sc = defaultdict(float) # use pid-keyed dict (matches training)
for ant_pid in cart_sequence:
for con_pid, sc in assoc_by_pid.get(ant_pid, {}).items():
assoc_sc[con_pid] += sc
pos_p = pos_prob.get(min(cart_sz, MAX_POS - 1), {})
rows = [{"user_id": user_id, "product_id": int(pid),
"cart_size": cart_sz,
"markov_score": float(markov_sc.get(pid, 0.0)),
"assoc_score": float(assoc_sc.get(pid, 0.0)),
"pos_prob_next": float(pos_p.get(pid, 0.0))}
for pid in candidates]
inf_df = (pd.DataFrame(rows)
.merge(prod_feat, on="product_id", how="left")
.merge(user_feat, on="user_id", how="left")
.merge(up_feat, on=["user_id","product_id"], how="left")
.fillna(0))
# Apply the same column rename as training
cols = list(inf_df.columns); seen = {}; new_cols = []
for c in cols:
if c in seen:
seen[c] += 1
new_cols.append(f"{c}_user" if c == "reorder_rate" else f"{c}_{seen[c]}")
else:
seen[c] = 0; new_cols.append(c)
inf_df.columns = new_cols
for fc in FEATURE_COLS: # guarantee column alignment
if fc not in inf_df.columns:
inf_df[fc] = 0.0
scores = lgb_model.predict_proba(inf_df[FEATURE_COLS].values)[:, 1]
inf_df["score"] = scores
return inf_df.nlargest(K, "score")["product_id"].tolist()
print("β
S5 (LightGBM) ready")
recs5 = s5_lgbm(_u0, _c0, K=3)
print(f" Sample recs : {[pid_to_name.get(p,'?')[:25] for p in recs5]}")
β
S5 (LightGBM) ready
Sample recs : ['banana', 'sweet red grape tomatoes', 'whole wheat bread']
9. Model Evaluation
Each strategy is scored on the same held-out eval_sample using five retrieval metrics at K = 5 and K = 10. The evaluation simulates a real shopping session: each userβs last prior order serves as the current cart, and we check whether the strategyβs top-K recommendations intersect with the userβs actual train basket.
| Metric | Formula | What it captures |
|---|---|---|
| Precision\@K | |recs β© true| / K | Of K recommendations, what fraction are relevant |
| Recall\@K | |recs β© true| / |true| | What fraction of the true basket was surfaced |
| F1\@K | 2Β·PΒ·R / (P+R) | Harmonic mean β balanced view of P and R |
| NDCG\@K | DCG / IDCG | Ranking quality β rewards relevant items appearing earlier in the list |
| Hit Rate\@K | 1 if any hit else 0 | Fraction of users who received β₯ 1 relevant recommendation |
# Scoring functions
def precision_at_k(recs, true_set, k):
return len(set(recs[:k]) & true_set) / k if k else 0.0
def recall_at_k(recs, true_set, k):
return len(set(recs[:k]) & true_set) / len(true_set) if true_set else 0.0
def f1_at_k(recs, true_set, k):
p = precision_at_k(recs, true_set, k)
r = recall_at_k(recs, true_set, k)
return 2*p*r/(p+r) if (p+r) > 0 else 0.0
def ndcg_at_k(recs, true_set, k):
dcg = sum(1/np.log2(i+2) for i, p in enumerate(recs[:k]) if p in true_set)
ideal = sum(1/np.log2(i+2) for i in range(min(len(true_set), k)))
return dcg/ideal if ideal else 0.0
def score_recs(recs, true_set, K):
return {
"precision": precision_at_k(recs, true_set, K),
"recall": recall_at_k(recs, true_set, K),
"f1": f1_at_k(recs, true_set, K),
"ndcg": ndcg_at_k(recs, true_set, K),
"hit_rate": 1 if any(p in true_set for p in recs[:K]) else 0,
}
def evaluate_strategy(fn, data, K=10, n=None, seed = 100):
"""Standard per-user evaluator β used for S1βS4."""
rows = data if n is None else data.sample(n=n, random_state=seed)
m = {"precision":[],"recall":[],"f1":[],"ndcg":[],"hit_rate":[]}
for _, row in rows.iterrows():
s = score_recs(fn(row["user_id"], row["cart_sequence"], K=K), row["true_items"], K)
for key in m: m[key].append(s[key])
return {k: float(np.mean(v)) for k, v in m.items()}
def evaluate_lgbm_batched(eval_data, K = 10, n = None, seed = 100, chunk_size=200):
"""
Batched evaluator for S5 β builds the full feature matrix for
chunk_size users at once, calls predict_proba once per chunk,
then scores. Avoids 3000 individual DataFrame merges.
"""
rows_df = eval_data if n is None else eval_data.sample(n=n, random_state=seed)
rows_df = rows_df.reset_index(drop=True)
m = {"precision":[],"recall":[],"f1":[],"ndcg":[],"hit_rate":[]}
for chunk_start in range(0, len(rows_df), chunk_size):
chunk = rows_df.iloc[chunk_start:chunk_start + chunk_size]
cand_rows = []
for idx, row in chunk.iterrows():
uid = int(row["user_id"])
cart_seq = row["cart_sequence"]
in_cart = set(cart_seq)
cart_sz = len(cart_seq)
personal = [p for p in user_history.get(uid, [])[:MAX_PERSONAL] if p not in in_cart]
globl = [p for p in GLOBAL_POOL if p not in in_cart and p not in set(personal)]
candidates = personal + globl
if not candidates:
candidates = s1_popularity(uid, cart_seq, K)
markov_sc = defaultdict(float)
for prev in cart_seq:
for nxt, p in trans_prob.get(prev, {}).items():
markov_sc[nxt] += p
assoc_sc = defaultdict(float)
for ant_pid in cart_seq:
for con_pid, sc in assoc_by_pid.get(ant_pid, {}).items():
assoc_sc[con_pid] += sc
pos_p = pos_prob.get(min(cart_sz, MAX_POS - 1), {})
for pid in candidates:
cand_rows.append((idx, uid, int(pid), cart_sz,
float(markov_sc.get(pid, 0.0)),
float(assoc_sc.get(pid, 0.0)),
float(pos_p.get(pid, 0.0))))
if not cand_rows:
continue
c_df = pd.DataFrame(cand_rows, columns=[
"idx","user_id","product_id","cart_size",
"markov_score","assoc_score","pos_prob_next"])
c_df = (c_df
.merge(prod_feat, on="product_id", how="left")
.merge(user_feat, on="user_id", how="left")
.merge(up_feat, on=["user_id","product_id"], how="left")
.fillna(0))
# Same column rename as training
cols = list(c_df.columns); seen = {}; new_cols = []
for col in cols:
if col in seen:
seen[col] += 1
new_cols.append(f"{col}_user" if col == "reorder_rate" else f"{col}_{seen[col]}")
else:
seen[col] = 0; new_cols.append(col)
c_df.columns = new_cols
for fc in FEATURE_COLS:
if fc not in c_df.columns: c_df[fc] = 0.0
# One predict_proba call for the whole chunk
c_df["score"] = lgb_model.predict_proba(c_df[FEATURE_COLS].values)[:, 1]
# Score each user from the chunk results
for idx, row in chunk.iterrows():
user_cands = c_df[c_df["idx"] == idx]
if user_cands.empty:
recs = []
else:
recs = user_cands.nlargest(K, "score")["product_id"].tolist()
s = score_recs(recs, row["true_items"], K)
for key in m: m[key].append(s[key])
del c_df, cand_rows; gc.collect()
return {k: float(np.mean(v)) for k, v in m.items()}
# ββ Run all five strategies at K=5 and K=10 βββββββββββββββββββββββββββββββββββ
STRATEGY_MAP = {
"S1: Popularity" : s1_popularity,
"S2: Personal History" : s2_personal,
"S3: Assoc Rules" : s3_rules,
"S4: Markov Chain" : s4_markov,
"S5: LightGBM" : s5_lgbm, # inference fn still available for demo use
}
N_EVAL = 1500
all_results = {}
print(f"Evaluating {len(STRATEGY_MAP)} strategies on {N_EVAL} users β¦\n")
header = f" {'Strategy':<22} {'P@5':>6} {'R@5':>6} {'F1@5':>6} {'NDCG@5':>7} {'P@10':>6} {'R@10':>6} {'F1@10':>6} {'NDCG@10':>8} {'Hit@10':>7} {'s':>4}"
print(header); print("β"*len(header))
for name, fn in STRATEGY_MAP.items():
row_parts = [f" {name:<22}"]
for k_val in [5, 10]:
t0 = time.time()
if name == "S5: LightGBM":
r = evaluate_lgbm_batched(eval_sample, K=k_val, n=N_EVAL)
else:
r = evaluate_strategy(fn, eval_sample, K=k_val, n=N_EVAL)
dt = time.time() - t0
all_results[(name, k_val)] = r
row_parts.append(
f" {r['precision']:6.4f} {r['recall']:6.4f} {r['f1']:6.4f} {r['ndcg']:7.4f}")
row_parts.append(f" {r['hit_rate']:7.4f} {dt:4.1f}s")
print("".join(row_parts))
Evaluating 5 strategies on 1500 users β¦
Strategy P@5 R@5 F1@5 NDCG@5 P@10 R@10 F1@10 NDCG@10 Hit@10 s
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
S1: Popularity 0.0544 0.0258 0.0313 0.0603 0.0449 0.0400 0.0380 0.0559 0.3200 0.2s
S2: Personal History 0.2148 0.1245 0.1355 0.2518 0.1627 0.1744 0.1469 0.2306 0.6927 0.2s
S3: Assoc Rules 0.1588 0.0996 0.1051 0.1822 0.1444 0.1619 0.1329 0.1891 0.6760 0.2s
S4: Markov Chain 0.2061 0.1146 0.1275 0.2414 0.1516 0.1588 0.1352 0.2164 0.6593 0.5s
S5: LightGBM 0.2241 0.1313 0.1428 0.2588 0.1720 0.1871 0.1554 0.2402 0.7233 49.1s
# ββ Results table βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
result_rows = []
for (name, k), m in all_results.items():
result_rows.append({"Strategy": name, "K": k, **m})
results_df = pd.DataFrame(result_rows)
print("=== Full Results Table ===")
display(results_df.sort_values(["K","ndcg"], ascending=[True,False]).reset_index(drop=True))
# ββ Grouped bar charts ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
strategy_names = list(STRATEGY_MAP.keys())
palette = sns.color_palette("husl", len(strategy_names))
metrics_info = [("precision","Precision@K"),("recall","Recall@K"),
("f1","F1@K"),("ndcg","NDCG@K"),("hit_rate","Hit Rate@K")]
fig, axes = plt.subplots(2, 3, figsize=(22, 10))
for ax_idx, (metric, label) in enumerate(metrics_info):
ax = axes[ax_idx//3][ax_idx%3]
x = np.arange(len(strategy_names)); w = 0.35
for ki, k_val in enumerate([5, 10]):
vals = [all_results[(s, k_val)][metric] for s in strategy_names]
bars = ax.bar(x + ki*w, vals, w, label=f"K={k_val}",
color=[palette[i] for i in range(len(strategy_names))],
alpha=0.7+ki*0.2, edgecolor="white")
for bar, v in zip(bars, vals):
ax.text(bar.get_x()+bar.get_width()/2, v+0.001,
f"{v:.3f}", ha="center", va="bottom", fontsize=7, rotation=45)
ax.set_xticks(x+w/2)
ax.set_xticklabels([s.split(":")[0] for s in strategy_names], fontsize=9)
ax.set_title(label, fontsize=12, fontweight="bold")
ax.legend(fontsize=8); ax.set_ylim(0, ax.get_ylim()[1]*1.2)
# Feature importance panel
ax = axes[1][2]
fi_top = fi.head(10)
bars = ax.barh(fi_top.index[::-1], fi_top.values[::-1],
color=sns.color_palette("viridis", 10))
ax.set_title("LightGBM Feature Importance (top 10)", fontsize=12, fontweight="bold")
ax.set_xlabel("Importance")
cart_feats = {"markov_score","assoc_score","pos_prob_next","cart_size"}
for bar, feat in zip(bars[::-1], fi_top.index):
if feat in cart_feats:
ax.text(bar.get_width()*0.03, bar.get_y()+bar.get_height()/2,
"π cart-ctx", va="center", fontsize=7, color="navy")
plt.suptitle("Strategy Benchmark β All Metrics @ K=5 and K=10",
fontsize=15, fontweight="bold", y=1.01)
plt.tight_layout(); plt.show()
# ββ Radar chart βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
radar_metrics = ["precision","recall","f1","ndcg","hit_rate"]
radar_labels = ["P@10","R@10","F1@10","NDCG@10","Hit@10"]
N = len(radar_metrics)
angles = np.linspace(0, 2*np.pi, N, endpoint=False).tolist(); angles += angles[:1]
fig_r, ax_r = plt.subplots(figsize=(8,8), subplot_kw=dict(polar=True))
colors = sns.color_palette("husl", len(strategy_names))
for i, sname in enumerate(strategy_names):
vals = [all_results[(sname, 10)][m] for m in radar_metrics]; vals += vals[:1]
ax_r.plot(angles, vals, linewidth=2, label=sname, color=colors[i])
ax_r.fill(angles, vals, alpha=0.08, color=colors[i])
ax_r.set_xticks(angles[:-1]); ax_r.set_xticklabels(radar_labels, fontsize=11)
ax_r.set_title("Strategy Profiles @ K=10", fontsize=14, fontweight="bold", pad=20)
ax_r.legend(loc="upper right", bbox_to_anchor=(1.35, 1.15), fontsize=9)
plt.tight_layout(); plt.show()
=== Full Results Table ===
| Strategy | K | precision | recall | f1 | ndcg | hit_rate | |
|---|---|---|---|---|---|---|---|
| 0 | S5: LightGBM | 5 | 0.2241 | 0.1313 | 0.1428 | 0.2588 | 0.6273 |
| 1 | S2: Personal History | 5 | 0.2148 | 0.1245 | 0.1355 | 0.2518 | 0.5980 |
| 2 | S4: Markov Chain | 5 | 0.2061 | 0.1146 | 0.1275 | 0.2414 | 0.5780 |
| 3 | S3: Assoc Rules | 5 | 0.1588 | 0.0996 | 0.1051 | 0.1822 | 0.4920 |
| 4 | S1: Popularity | 5 | 0.0544 | 0.0258 | 0.0313 | 0.0603 | 0.2333 |
| 5 | S5: LightGBM | 10 | 0.1720 | 0.1871 | 0.1554 | 0.2402 | 0.7233 |
| 6 | S2: Personal History | 10 | 0.1627 | 0.1744 | 0.1469 | 0.2306 | 0.6927 |
| 7 | S4: Markov Chain | 10 | 0.1516 | 0.1588 | 0.1352 | 0.2164 | 0.6593 |
| 8 | S3: Assoc Rules | 10 | 0.1444 | 0.1619 | 0.1329 | 0.1891 | 0.6760 |
| 9 | S1: Popularity | 10 | 0.0449 | 0.0400 | 0.0380 | 0.0559 | 0.3200 |
10. Interactive Prediction Demo & Leaderboard
A unified recommend() call wraps all five strategies behind a single interface. Three progressive demos illustrate the system in action:
- Demo A β LightGBM recommendations updating in real time as items are added one by one to the cart
- Demo B β Side-by-side strategy comparison on a single userβs full cart, showing which strategy scores the most hits
- Demo C β Custom cart specified by product name strings (e.g.,
"banana","organic milk"), with output from Markov, Rules, and LightGBM
The section closes with a composite leaderboard ranking all five strategies by an equal-weighted average of all reported metrics.
# Unified recommend() wrapper β delegates to whichever strategy is selected
def recommend(user_id, cart_sequence, K = 10, strategy = "lgbm"):
"""
strategy: "popularity" | "personal" | "rules" | "markov" | "lgbm"
Returns a DataFrame: rank, product, department, aisle
"""
fn_map = {"popularity": s1_popularity,
"personal": s2_personal,
"rules": s3_rules,
"markov": s4_markov,
"lgbm": s5_lgbm}
pids = fn_map.get(strategy, s5_lgbm)(user_id, cart_sequence, K=K)
out = pd.DataFrame({"rank":range(1,len(pids)+1),"product_id":pids})
return (out.merge(products[["product_id","product","department","aisle"]], on = "product_id", how = "left")[["rank","product","department","aisle"]])
# Demo A β recommendations update as the cart grows
demo_user = eval_sample["user_id"].iloc[5]
demo_cart = eval_sample["cart_sequence"].iloc[5]
demo_true = eval_sample["true_items"].iloc[5]
print(f"User {demo_user} | true next basket ({len(demo_true)} items):")
print([pid_to_name.get(p,"?")[:25] for p in list(demo_true)[:6]], "...\n")
print("="*72)
print(f" {'Cart':>5} {'Last item added':<30} Top-5 recs (LightGBM) [hits β
]")
print("="*72)
for reveal in [1, 3, 5, len(demo_cart)]:
partial = demo_cart[:reveal]
last = pid_to_name.get(partial[-1],"?")[:28]
recs = s5_lgbm(demo_user, partial, K=5)
names = [pid_to_name.get(p,"?")[:16]+("β
" if p in demo_true else "") for p in recs]
print(f" {reveal:>5} {last:<30} {names}")
User 12578 | true next basket (2 items):
['hazelnut spread with coco', 'lobster bisque'] ...
========================================================================
Cart Last item added Top-5 recs (LightGBM) [hits β
]
========================================================================
1 organic simply naked pita ch ['non fat greek yo', 'april fresh liqu', 'organic salted b', 'boneless skinles', 'mango peach sals']
3 mango peach salsa ['non fat greek yo', 'april fresh liqu', 'organic salted b', 'boneless skinles', 'cereal']
5 organic baby spinach ['non fat greek yo', 'april fresh liqu', 'organic salted b', 'boneless skinles', 'bag of organic b']
6 bag of organic bananas ['non fat greek yo', 'april fresh liqu', 'organic salted b', 'boneless skinles', 'strawberries']
# Demo B β side-by-side strategy comparison on a single user
K_demo = 10
print(f"User {demo_user} | Cart size: {len(demo_cart)} | True basket: {len(demo_true)} items")
print(f"Cart: {[pid_to_name.get(p,'?')[:18] for p in demo_cart[:4]]}β¦\n")
comparison = {}
for sname, fn in STRATEGY_MAP.items():
recs = fn(demo_user, demo_cart, K=K_demo)
hits = [p for p in recs if p in demo_true]
comparison[sname] = {"recs":recs,"hits":len(hits),
"P": len(hits)/K_demo,
"R": len(hits)/len(demo_true) if demo_true else 0}
comp_df = pd.DataFrame({
sname: [pid_to_name.get(p,"?")[:28]+(" β
" if p in demo_true else "")
for p in d["recs"]]
for sname, d in comparison.items()
})
comp_df.index = [f"Rec {i+1}" for i in range(K_demo)]
display(comp_df)
print("\nP@10 / R@10 per strategy:")
for sname, d in comparison.items():
print(f" {sname:<22} hits={d['hits']} P@10={d['P']:.2f} R@10={d['R']:.2f}")
User 12578 | Cart size: 6 | True basket: 2 items
Cart: ['organic simply nak', 'wheat thins origin', 'mango peach salsa', 'cereal']β¦
| S1: Popularity | S2: Personal History | S3: Assoc Rules | S4: Markov Chain | S5: LightGBM | |
|---|---|---|---|---|---|
| Rec 1 | banana | non fat greek yogurt | organic hass avocado | non fat greek yogurt | non fat greek yogurt |
| Rec 2 | organic strawberries | organic salted butter | organic strawberries | boneless skinless chicken br | april fresh liquid fabric so |
| Rec 3 | organic hass avocado | boneless skinless chicken br | organic avocado | organic salted butter | organic salted butter |
| Rec 4 | organic avocado | april fresh liquid fabric so | banana | organic hass avocado | boneless skinless chicken br |
| Rec 5 | large lemon | brioche hamburger buns | organic raspberries | organic strawberries | strawberries |
| Rec 6 | strawberries | cheez it cheddar cracker | large lemon | april fresh liquid fabric so | organic tortilla chips |
| Rec 7 | limes | naked green machine boosted | non fat greek yogurt | organic avocado | cheez it cheddar cracker |
| Rec 8 | organic raspberries | chicken pot pie | organic salted butter | organic raspberries | brioche hamburger buns |
| Rec 9 | organic whole milk | organic spaghetti | boneless skinless chicken br | banana | naked green machine boosted |
| Rec 10 | organic yellow onion | strawberries | april fresh liquid fabric so | naked green machine boosted | raspberries |
P@10 / R@10 per strategy:
S1: Popularity hits=0 P@10=0.00 R@10=0.00
S2: Personal History hits=0 P@10=0.00 R@10=0.00
S3: Assoc Rules hits=0 P@10=0.00 R@10=0.00
S4: Markov Chain hits=0 P@10=0.00 R@10=0.00
S5: LightGBM hits=0 P@10=0.00 R@10=0.00
# Demo C β custom cart specified by product name strings
print("\n=== Custom cart: banana, strawberry, organic milk ===")
demo_u2 = eval_sample["user_id"].iloc[10]
cart_names = ["banana","strawberry","organic milk"]
cart_ids = []
for name in cart_names:
match = [pid for pname,pid in name_to_pid.items() if name.lower() in pname.lower()]
if match: cart_ids.append(match[0])
print(f"Resolved: {[pid_to_name.get(p,'?')[:25] for p in cart_ids]}\n")
for strat in ["markov","rules","lgbm"]:
print(f"ββ {strat.upper()} ββ")
display(recommend(demo_u2, cart_ids, K=5, strategy=strat))
=== Custom cart: banana, strawberry, organic milk ===
Resolved: ['banana sweet potato organ', 'light strawberry blueberr', 'organic milk']
ββ MARKOV ββ
| rank | product | department | aisle | |
|---|---|---|---|---|
| 0 | 1 | happy baby spinachmangoand pear baby food | babies | baby food formula |
| 1 | 2 | blueberry oats and quinoa whole grain snack | babies | baby food formula |
| 2 | 3 | organic banana blueberry baby food puree | babies | baby food formula |
| 3 | 4 | alpine spring water | beverages | water seltzer sparkling water |
| 4 | 5 | berry barley organic baby food | babies | baby food formula |
ββ RULES ββ
| rank | product | department | aisle | |
|---|---|---|---|---|
| 0 | 1 | happy baby spinachmangoand pear baby food | babies | baby food formula |
| 1 | 2 | blueberry oats and quinoa whole grain snack | babies | baby food formula |
| 2 | 3 | organic banana blueberry baby food puree | babies | baby food formula |
| 3 | 4 | alpine spring water | beverages | water seltzer sparkling water |
| 4 | 5 | berry barley organic baby food | babies | baby food formula |
ββ LGBM ββ
| rank | product | department | aisle | |
|---|---|---|---|---|
| 0 | 1 | banana | produce | fresh fruits |
| 1 | 2 | organic pearraspberry purple carrot stage pouch | babies | baby food formula |
| 2 | 3 | organic stage broccoli pears peas baby food | babies | baby food formula |
| 3 | 4 | stage spinachapple kale | babies | baby food formula |
| 4 | 5 | peter rabbit organics kale broccoli and mango ... | babies | baby food formula |
# Construct final composite leaderboard across all strategies
lb_rows = []
for sname in strategy_names:
r5 = all_results[(sname, 5)]
r10 = all_results[(sname, 10)]
composite = np.mean([r5["precision"],r5["recall"],r5["f1"],r5["ndcg"],
r10["precision"],r10["recall"],r10["f1"],r10["ndcg"],
r10["hit_rate"]])
lb_rows.append({"Strategy":sname,
"P@5":f"{r5['precision']:.4f}","R@5":f"{r5['recall']:.4f}",
"F1@5":f"{r5['f1']:.4f}","NDCG@5":f"{r5['ndcg']:.4f}",
"P@10":f"{r10['precision']:.4f}","R@10":f"{r10['recall']:.4f}",
"F1@10":f"{r10['f1']:.4f}","NDCG@10":f"{r10['ndcg']:.4f}",
"Hit@10":f"{r10['hit_rate']:.4f}",
"Composite β":round(composite,4)})
lb = (pd.DataFrame(lb_rows)
.sort_values("Composite β", ascending=False)
.reset_index(drop=True))
lb.index += 1
print("\nπ Strategy Leaderboard")
display(lb)
print(f"\nπ₯ Winner: {lb.iloc[0]['Strategy']}")
print("\nKey takeaways:")
print(" β’ Cart-sequence signals (Markov + position) lift recs beyond popularity baselines.")
print(" β’ LightGBM combines all signal types; typically best NDCG (rewards correct ranking).")
print(" β’ Association Rules add diversity; Markov is fastest at inference.")
print(" β’ Cold-start users (no history): fall back to S1 + S3.")
π Strategy Leaderboard
| Strategy | P@5 | R@5 | F1@5 | NDCG@5 | P@10 | R@10 | F1@10 | NDCG@10 | Hit@10 | Composite β | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | S5: LightGBM | 0.2241 | 0.1313 | 0.1428 | 0.2588 | 0.1720 | 0.1871 | 0.1554 | 0.2402 | 0.7233 | 0.2483 |
| 2 | S2: Personal History | 0.2148 | 0.1245 | 0.1355 | 0.2518 | 0.1627 | 0.1744 | 0.1469 | 0.2306 | 0.6927 | 0.2371 |
| 3 | S4: Markov Chain | 0.2061 | 0.1146 | 0.1275 | 0.2414 | 0.1516 | 0.1588 | 0.1352 | 0.2164 | 0.6593 | 0.2235 |
| 4 | S3: Assoc Rules | 0.1588 | 0.0996 | 0.1051 | 0.1822 | 0.1444 | 0.1619 | 0.1329 | 0.1891 | 0.6760 | 0.2056 |
| 5 | S1: Popularity | 0.0544 | 0.0258 | 0.0313 | 0.0603 | 0.0449 | 0.0400 | 0.0380 | 0.0559 | 0.3200 | 0.0745 |
π₯ Winner: S5: LightGBM
Key takeaways:
β’ Cart-sequence signals (Markov + position) lift recs beyond popularity baselines.
β’ LightGBM combines all signal types; typically best NDCG (rewards correct ranking).
β’ Association Rules add diversity; Markov is fastest at inference.
β’ Cold-start users (no history): fall back to S1 + S3.




