InstaCart User Data: A Market Basket Analysis & Recommendation Engine

55 minute read

Published: March 26, 2026

This post is a walk-through of a detailed analysis of InstaCart user data, which is used to study shopping trends and create a machine learning interface to make product recommendations to users as they fill their carts.

🛒 Instacart Market Basket Analysis — Predictive Deep Dive

This notebook provides an in-depth analysis of the Instacart dataset with a predictive focus.

Data sources: orders.csv · order_products__prior.csv · order_products__train.csv · products.csv · aisles.csv · departments.csv
Prediction target: Given a user’s full purchase history (prior orders) and optionally their current in-session cart, predict which products they will buy next — evaluated against the held-out train orders.

📋 Table of Contents

#	Section	Description
1	Setup & Data Loading	Import libraries, mount Google Drive, load all 6 CSV files from zip
2	Dataset Overview	Schema inspection, null checks, row counts, eval-set distribution
3	EDA — Orders & Timing	When do customers shop? Day-of-week, hour-of-day, order cadence
4	EDA — Products & Categories	Top products, department breakdown, aisle-level reorder rates
5	EDA — User Behavior	Cart-size distribution, reorder tendencies, user loyalty patterns
6	Feature Engineering	Product-, user-, and user-product interaction features for ML
7	Association Rule Mining	Market basket analysis with Apriori; support, confidence, and lift
8	Next-Item Prediction Strategies	5 strategies — from a simple popularity baseline to LightGBM
9	Model Evaluation	Precision\@K, Recall\@K, F1\@K, NDCG\@K, Hit Rate\@K benchmarks
10	Interactive Demo & Leaderboard	Live recommendation demos and final strategy leaderboard

1. Setup & Data Loading

All six Instacart CSV files are read directly from a zip archive on Google Drive. Casting columns to compact int8/int16/int32 dtypes immediately after loading keeps RAM manageable for the 30 M+ row dataset.

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import re
import zipfile
import os
from collections import defaultdict, Counter
from itertools import combinations
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
import warnings
from google.colab import drive
import gc
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (roc_auc_score, average_precision_score, classification_report)
from sklearn.model_selection import GroupShuffleSplit
import time

warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.4f}'.format)
%matplotlib inline
print('✅ Libraries loaded')

✅ Libraries loaded

# Mount drive to access data
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).

data = {}

# Read files from zip
with zipfile.ZipFile('/content/drive/MyDrive/instacart/instacart.zip', 'r') as z:

    file_list = [f for f in z.namelist() if f.endswith('.csv')]

    print(f'Files in archive: {file_list}\n')

    for fn in file_list:

        key = fn.replace('.csv', '')

        # Update
        data[key] = pd.read_csv(z.open(fn))
        print(f'  ✓ {fn}: {data[key].shape[0]:,} rows × {data[key].shape[1]} cols')

print('\n✅ All datasets loaded')

Files in archive: ['aisles.csv', 'departments.csv', 'order_products__prior.csv', 'order_products__train.csv', 'orders.csv', 'products.csv']

  ✓ aisles.csv: 134 rows × 2 cols
  ✓ departments.csv: 21 rows × 2 cols
  ✓ order_products__prior.csv: 32,434,489 rows × 4 cols
  ✓ order_products__train.csv: 1,384,617 rows × 4 cols
  ✓ orders.csv: 3,421,083 rows × 7 cols
  ✓ products.csv: 49,688 rows × 4 cols

✅ All datasets loaded

2. Dataset Overview

The dataset ships as six files that join on shared keys:

File	Key columns	Purpose
`orders`	`order_id`, `user_id`	One row per order; includes timing and `eval_set` label
`order_products__prior`	`order_id`, `product_id`	Every item in every prior order
`order_products__train`	`order_id`, `product_id`	Items in the single held-out train order per user
`products`	`product_id`, `aisle_id`, `department_id`	Product names and taxonomy
`aisles`	`aisle_id`	Aisle labels
`departments`	`department_id`	Department labels

The eval_set column in orders partitions every user’s orders into three splits:

prior — all historical orders used to build user profiles and train models
train — one order per user that acts as the ground truth for model evaluation
test — the competition hold-out (no order-product rows provided)

# Inspect schema
print('=' * 65)
print('DATASET SCHEMA & NULL REPORT')
print('=' * 65)
for name, df in data.items():

    nulls = df.isnull().sum()
    null_info = {c: n for c, n in nulls.items() if n > 0}

    print(f'\n📂 {name}')
    print(f'   Rows: {df.shape[0]:,}  |  Cols: {df.shape[1]}')
    print(f'   Columns: {list(df.columns)}')

    if null_info:

        print(f'   ⚠️  Nulls: {null_info}')

    else:

        print(f'   ✓  No nulls')

    display(df.head(3))

=================================================================
DATASET SCHEMA & NULL REPORT
=================================================================

📂 aisles
   Rows: 134  |  Cols: 2
   Columns: ['aisle_id', 'aisle']
   ✓  No nulls

	aisle_id	aisle
0	1	prepared soups salads
1	2	specialty cheeses
2	3	energy granola bars

📂 departments
   Rows: 21  |  Cols: 2
   Columns: ['department_id', 'department']
   ✓  No nulls

	department_id	department
0	1	frozen
1	2	other
2	3	bakery

📂 order_products__prior
   Rows: 32,434,489  |  Cols: 4
   Columns: ['order_id', 'product_id', 'add_to_cart_order', 'reordered']
   ✓  No nulls

	order_id	product_id	add_to_cart_order	reordered
0	2	33120	1	1
1	2	28985	2	1
2	2	9327	3	0

📂 order_products__train
   Rows: 1,384,617  |  Cols: 4
   Columns: ['order_id', 'product_id', 'add_to_cart_order', 'reordered']
   ✓  No nulls

	order_id	product_id	add_to_cart_order	reordered
0	1	49302	1	1
1	1	11109	2	1
2	1	10246	3	0

📂 orders
   Rows: 3,421,083  |  Cols: 7
   Columns: ['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow', 'order_hour_of_day', 'days_since_prior_order']
   ⚠️  Nulls: {'days_since_prior_order': 206209}

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0000
2	473747	1	prior	3	3	12	21.0000

📂 products
   Rows: 49,688  |  Cols: 4
   Columns: ['product_id', 'product_name', 'aisle_id', 'department_id']
   ✓  No nulls

	product_id	product_name	aisle_id	department_id
0	1	Chocolate Sandwich Cookies	61	19
1	2	All-Seasons Salt	104	13
2	3	Robust Golden Unsweetened Oolong Tea	94	7

# Inspect the eval data
eval_counts = data['orders']['eval_set'].value_counts()

print('Orders by eval_set:')

for k, v in eval_counts.items():

    print(f'  {k:8s}: {v:>10,} orders  ({v/len(data["orders"])*100:.1f}%)')

print(f'\nTotal unique users: {data["orders"]["user_id"].nunique():,}')
print(f'Total unique products: {data["products"].shape[0]:,}')
print(f'Total aisles: {data["aisles"].shape[0]}  |  Departments: {data["departments"].shape[0]}')

Orders by eval_set:
  prior   :  3,214,874 orders  (94.0%)
  train   :    131,209 orders  (3.8%)
  test    :     75,000 orders  (2.2%)

Total unique users: 206,209
Total unique products: 49,688
Total aisles: 134  |  Departments: 21

Note on the data split: Every user has exactly one train order and one test order, with all remaining orders labelled prior. Models are trained using prior interactions and evaluated by predicting the contents of the train order. The test split has no product labels and is not used in this notebook.

3. EDA — Orders & Timing

Understanding when customers shop reveals weekly and intra-day demand rhythms that carry useful signals for models — e.g., a user who always shops on Sunday mornings is likely to continue that pattern. This section also examines how often users return and how their inter-order gap is distributed.

# Normalize product names (lowercase, strip punctuation)
data['products']['product_name'] = (data['products']['product_name'].apply(lambda x: re.sub(r'[^\w\s]', ' ', x.lower())).apply(lambda x: re.sub(r'\d+|  ', '', x).strip()))

# Merge the codes with the labels
products = (data['products'].merge(data['aisles'], on = 'aisle_id', how = 'left').merge(data['departments'], on = 'department_id',  how = 'left').drop(['aisle_id', 'department_id'], axis = 1).rename(columns = {'product_name': 'product'}))
print(f'Products enriched: {products.shape}')
products.head(3)

Products enriched: (49688, 4)

	product_id	product	aisle	department
0	1	chocolate sandwich cookies	cookies cakes	snacks
1	2	all seasons salt	spices seasonings	pantry
2	3	robust golden unsweetened oolong tea	tea	beverages

# Visualize order volumes
fig, axes = plt.subplots(1, 2, figsize=(20, 5))

dow_labels = {0: 'Sun', 1: 'Mon', 2: 'Tue', 3: 'Wed', 4: 'Thu', 5: 'Fri', 6: 'Sat'}
dow_counts = data['orders']['order_dow'].map(dow_labels).value_counts().reindex(['Sun','Mon','Tue','Wed','Thu','Fri','Sat'])

colors_dow = sns.color_palette('husl', 7)

bars = axes[0].bar(dow_counts.index, dow_counts.values, color = colors_dow, edgecolor = 'white', linewidth = 0.8)
axes[0].set_title('Orders by Day of Week', fontsize = 14, fontweight = 'bold')
axes[0].set_xlabel('Day of Week'); axes[0].set_ylabel('Number of Orders')

# Iterate
for bar, v in zip(bars, dow_counts.values):

    axes[0].text(bar.get_x() + bar.get_width()/2, v + 1500, f'{v:,}', ha = 'center', va = 'bottom', fontsize = 8)

hour_counts = data['orders']['order_hour_of_day'].value_counts().sort_index()
axes[1].plot(hour_counts.index, hour_counts.values, marker = 'o', linewidth = 2.5, color = '#2196F3', markersize = 5)
axes[1].fill_between(hour_counts.index, hour_counts.values, alpha=0.15, color='#2196F3')
axes[1].axvspan(8, 11, alpha=0.1, color='orange', label='Peak morning (8-11)')
axes[1].axvspan(14, 17, alpha=0.1, color='green', label='Peak afternoon (14-17)')
axes[1].set_title('Orders by Hour of Day', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Hour of Day'); axes[1].set_ylabel('Number of Orders')
axes[1].set_xticks(range(0, 24)); axes[1].legend()

plt.suptitle('When Do Customers Shop?', fontsize=16, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

print(f'Peak ordering day:  {dow_counts.idxmax()}    ({dow_counts.max():,} orders)')
print(f'Peak ordering hour: {hour_counts.idxmax()}:00  ({hour_counts.max():,} orders)')

png

Peak ordering day:  Sun    (600,905 orders)
Peak ordering hour: 10:00  (288,418 orders)

# Visualize user data
fig, axes = plt.subplots(1, 2, figsize = (20, 5))

# Days since prior order
dspo = data['orders']['days_since_prior_order'].dropna()

axes[0].hist(dspo, bins = 31, color = '#4CAF50', edgecolor = 'white', alpha = 0.85)
axes[0].axvline(dspo.mean(),   color = 'red',    linestyle = '--', lw = 2, label = f'Mean: {dspo.mean():.1f}d')
axes[0].axvline(dspo.median(), color = 'orange', linestyle = '--', lw = 2, label = f'Median: {dspo.median():.0f}d')
axes[0].axvline(7,  color = 'gray', linestyle = ':', alpha = 0.7, label = 'Weekly')
axes[0].axvline(30, color = 'gray', linestyle = ':', alpha = 0.7, label = 'Monthly')
axes[0].set_title('Days Since Prior Order Distribution', fontsize = 14, fontweight = 'bold')
axes[0].set_xlabel('Days'); axes[0].set_ylabel('Frequency')
axes[0].legend()

# Orders per user
orders_per_user = data['orders'].groupby('user_id')['order_id'].count()
axes[1].hist(orders_per_user, bins=50, color='#9C27B0', edgecolor='white', alpha=0.85)
axes[1].axvline(orders_per_user.mean(), color='red', linestyle='--', lw=2, label=f'Mean: {orders_per_user.mean():.1f}')
axes[1].set_title('Orders per User (all-time)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Number of Orders'); axes[1].set_ylabel('Number of Users')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f'Average days between orders:    {dspo.mean():.1f}')
print(f'Users who order weekly (≤7d):   {(dspo <= 7).mean()*100:.1f}%')
print(f'Users who order monthly (~30d): {(dspo.between(25,35)).mean()*100:.1f}%')
print(f'Users with 10+ orders:          {(orders_per_user >= 10).sum():,} ')
print(f'({(orders_per_user >= 10).mean()*100:.1f}%)')

png

Average days between orders:    11.1
Users who order weekly (≤7d):   50.4%
Users who order monthly (~30d): 14.8%
Users with 10+ orders:          110,728 
(53.7%)

# Truncate the data and track memory
all_user_ids = data['orders']['user_id'].unique()
sampled_uids = set(np.random.default_rng(100).choice(all_user_ids, size = min(150000, len(all_user_ids)), replace = False))
print(f'Working with {len(sampled_uids):,} sampled users')

# Filter the orders index to only sampled users — still cheap
orders_slim = data['orders'][data['orders']['user_id'].isin(sampled_uids)][['order_id','user_id','eval_set','order_number', 'order_dow','order_hour_of_day','days_since_prior_order']]

# Cast to smaller dtypes immediately
orders_slim = orders_slim.astype({'order_id':               'int32',
                                  'user_id':                'int32',
                                  'order_number':           'int16',
                                  'order_dow':              'int8',
                                  'order_hour_of_day':      'int8',})

print(f'orders_slim: {orders_slim.shape}  |  {orders_slim.memory_usage(deep = True).sum()/1e6:.1f} MB')

# Get only the order_ids belonging to sampled users
sampled_order_ids = set(orders_slim['order_id'].values)

# Filter prior and train BEFORE any merge
prior_raw = data['order_products__prior'][data['order_products__prior']['order_id'].isin(sampled_order_ids)].astype({'order_id':'int32','product_id':'int32','add_to_cart_order':'int16','reordered':'int8'})

train_raw = data['order_products__train'][data['order_products__train']['order_id'].isin(sampled_order_ids)].astype({'order_id':'int32','product_id':'int32','add_to_cart_order':'int16','reordered':'int8'})

print(f'prior_raw:  {prior_raw.shape[0]:,} rows  |  {prior_raw.memory_usage(deep = True).sum()/1e6:.1f} MB')
print(f'train_raw:  {train_raw.shape[0]:,} rows  |  {train_raw.memory_usage(deep = True).sum()/1e6:.1f} MB')

# Downcast products lookup (tiny, but keeps dtypes consistent)
products_slim = products
products_slim['product_id'] = products_slim['product_id'].astype('int32')

# Enrich prior and train separately — small frames now, safe to merge
def enrich(raw_df, eval_label):

    return (raw_df.merge(products_slim, on = 'product_id', how = 'left').merge(orders_slim.drop(columns = 'eval_set'), on = 'order_id', how = 'left').assign(eval_set=eval_label))

prior_op = enrich(prior_raw, 'prior')
del prior_raw; gc.collect()

train_op = enrich(train_raw, 'train')
del train_raw; gc.collect()

orders_master = pd.concat([prior_op, train_op], ignore_index = True)

print(f'\norders_master: {orders_master.shape}')
print(f'  prior rows:  {prior_op.shape[0]:,}')
print(f'  train rows:  {train_op.shape[0]:,}')
print(f'  total RAM:   {orders_master.memory_usage(deep=True).sum()/1e6:.1f} MB')

Working with 150,000 sampled users
orders_slim: (2485904, 7)  |  203.8 MB
prior_raw:  23,525,045 rows  |  447.0 MB
train_raw:  1,007,429 rows  |  19.1 MB

orders_master: (24532474, 13)
  prior rows:  23,525,045
  train rows:  1,007,429
  total RAM:   6748.3 MB

4. EDA — Products & Categories

This section examines the product catalogue: which items are ordered most, how order volume breaks down by department, and — crucially — which categories command the highest reorder rates. High reorder rate is the strongest single predictor used in later models.

# Visualize top products
fig, axes = plt.subplots(1, 3, figsize = (20, 6))

# Top 20 products
top_prod = prior_op['product'].value_counts().head(20)
axes[0].barh(top_prod.index[::-1], top_prod.values[::-1], color = sns.color_palette('viridis', 20))
axes[0].set_title('Top 20 Products (Prior Orders)', fontsize = 12, fontweight = 'bold')
axes[0].set_xlabel('Order Count')

for i, v in enumerate(top_prod.values[::-1]):

    axes[0].text(v + 1000, i, f'{v:,}', va = 'center', fontsize = 7)

# Department breakdown
dept_counts = prior_op['department'].value_counts().head(12)
axes[1].bar(range(len(dept_counts)), dept_counts.values, color = sns.color_palette('rocket', len(dept_counts)))
axes[1].set_xticks(range(len(dept_counts)))
axes[1].set_xticklabels(dept_counts.index, rotation=40, ha='right', fontsize=9)
axes[1].set_title('Orders by Department', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Order Count')

# Reorder rate by department
dept_reorder = prior_op.groupby('department')['reordered'].mean().sort_values(ascending=False).head(12)
bars = axes[2].barh(dept_reorder.index[::-1], dept_reorder.values[::-1], color = sns.color_palette('mako', len(dept_reorder)))
axes[2].axvline(0.5, color='red', linestyle='--', alpha=0.6, label='50% line')
axes[2].set_title('Reorder Rate by Department', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Reorder Rate')
axes[2].legend()

for bar in bars:

    w = bar.get_width()
    axes[2].text(w + 0.005, bar.get_y() + bar.get_height()/2, f'{w:.1%}', va='center', fontsize=8)

plt.suptitle('Product & Category Landscape', fontsize=16, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

print(f'Overall reorder rate: {prior_op["reordered"].mean()*100:.1f}%')
print(f'Highest reorder dept: {dept_reorder.idxmax()} ({dept_reorder.max():.1%})')
print(f'Lowest reorder dept:  {dept_reorder.idxmin()} ({dept_reorder.min():.1%})')

png

Overall reorder rate: 58.9%
Highest reorder dept: dairy eggs (67.0%)
Lowest reorder dept:  breakfast (56.1%)

# Visualize frequency of product categories
aisle_summary = prior_op.groupby('aisle').agg(order_count = ('order_id', 'count'), reorder_rate = ('reordered', 'mean')).reset_index().sort_values('order_count', ascending = False).head(30)

fig, ax = plt.subplots(figsize=(20, 8))

scatter = ax.scatter(aisle_summary['order_count'],
                     aisle_summary['reorder_rate'],
                     c = aisle_summary['reorder_rate'],
                     cmap = 'RdYlGn',
                     s = 100,
                     alpha = 0.8,
                     edgecolors = 'gray',
                     linewidth = 0.5)

for _, row in aisle_summary.iterrows():

    ax.annotate(row['aisle'], (row['order_count'], row['reorder_rate']), fontsize = 7, ha = 'left', va = 'bottom', alpha = 0.85)

plt.colorbar(scatter, ax=ax, label='Reorder Rate')
ax.set_xlabel('Total Orders (Prior)', fontsize=12)
ax.set_ylabel('Reorder Rate', fontsize=12)
ax.set_title('Top 30 Aisles: Volume vs Reorder Rate\n(high-right = popular & habitual)', fontsize = 13, fontweight='bold')
ax.axhline(prior_op['reordered'].mean(), color='gray', linestyle='--', alpha=0.5, label='Avg reorder rate')
ax.legend()
plt.tight_layout()
plt.show()

png

5. EDA — User Behavior

Having explored the product side, we now profile the users: how large are their baskets, how habitual are they (do they keep buying the same items?), and how does reorder tendency vary across the user base? These patterns motivate several of the features engineered in Section 6.

# Visualize cart size and user behavior patterns
fig, axes = plt.subplots(1, 3, figsize=(20, 5))

# Cart size distribution
cart_sizes = prior_op.groupby('order_id')['product_id'].count()
axes[0].hist(cart_sizes.clip(upper=50), bins=50, color='#FF5722', edgecolor='white', alpha=0.85)
axes[0].axvline(cart_sizes.mean(), color='black', linestyle='--', lw = 2, label = f'Mean: {cart_sizes.mean():.1f}')
axes[0].set_title('Cart Size Distribution', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Items per Order (capped at 50)'); axes[0].set_ylabel('Orders')
axes[0].legend()

# Reorder rate by cart position
atco = prior_op.groupby('add_to_cart_order')['reordered'].mean()
atco = atco[atco.index <= 30]

axes[1].plot(atco.index, atco.values, marker='o', linewidth=2.5, color='#FF9800')
axes[1].fill_between(atco.index, atco.values, alpha=0.15, color='#FF9800')
axes[1].set_title('Reorder Rate by Cart Position', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Add-to-Cart Position'); axes[1].set_ylabel('Reorder Rate')
axes[1].set_ylim(0, 1)
axes[1].annotate('First items\nare most habitual', xy=(1, atco.iloc[0]), xytext=(5, 0.85), arrowprops=dict(arrowstyle='->', color='gray'), fontsize=9)

# User reorder rate distribution
user_reorder = prior_op.groupby('user_id')['reordered'].mean()
axes[2].hist(user_reorder, bins=40, color='#3F51B5', edgecolor='white', alpha=0.85)
axes[2].axvline(user_reorder.mean(), color='red', linestyle='--', lw=2,
                label=f'Mean: {user_reorder.mean():.2f}')
axes[2].set_title('User Reorder Rate Distribution', fontsize=13, fontweight='bold')
axes[2].set_xlabel('Reorder Rate'); axes[2].set_ylabel('Number of Users')
axes[2].legend()

plt.suptitle('User Behavior Patterns', fontsize=16, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

print(f'Average cart size:                  {cart_sizes.mean():.1f} items')
print(f'Median cart size:                   {cart_sizes.median():.0f} items')
print(f'% users with reorder rate > 50%:    {(user_reorder > 0.5).mean()*100:.1f}%')
print(f'First-position reorder rate:        {atco.iloc[0]:.1%}')

png

Average cart size:                  10.1 items
Median cart size:                   8 items
% users with reorder rate > 50%:    38.4%
First-position reorder rate:        67.7%

6. Feature Engineering

We engineer three tiers of features that capture different aspects of purchasing behaviour:

Tier	Features	Captures
Product	Order count, reorder rate, avg/std cart position, global popularity rank	How popular and habitual a product is across all users
User	Total orders, avg cart size, reorder rate, avg/std days between orders, variety ratio	A user’s overall shopping style and cadence
User × Product	Times ordered, reorder rate, avg cart position, recency gap, reorder score	How strongly this user is attached to this product

The reorder_score = (times_ordered × reorder_rate) / (orders_since_last + 1) acts as a recency-weighted loyalty signal and is used in several strategies.

## <a id="section-7"></a>7. Association Rule Mining (Market Basket Analysis)

We apply the **Apriori** algorithm to a stratified sample of prior orders to discover which products are frequently purchased together.

**Key metrics:**
- **Support** — fraction of baskets containing the itemset; a measure of how common the pair is
- **Confidence** — P(B | A), the probability B is bought given A is already in the basket
- **Lift** — confidence ÷ P(B alone); values > 1 indicate a genuine positive association beyond chance

**Sampling strategy:**
1. Draw 50,000 users from the large `prior` set
2. Use only each user's *most recent* prior order (most predictive of future behaviour)
3. Restrict the product space to the top-5,000 items to keep the boolean matrix tractable
4. Set `min_support = 0.01` (at least 1% of sample baskets) so item pairs are discoverable
5. Retain only rules with `lift > 1.0` — positive associations only

The resulting rule dictionaries (`rule_lookup`, `rule_single`) are used directly by **Strategy 3** (Association Rules recommender).

Product features shape: (49549, 11)

Top 10 most ordered products:

	product	department	order_count	reorder_rate	unique_users	avg_cart_position
0	banana	produce	342433	0.8432	53698	4.8967
1	bag of organic bananas	produce	274530	0.8316	46219	5.0874
2	organic strawberries	produce	191724	0.7767	42804	7.2159
3	organic baby spinach	produce	175904	0.7720	40100	7.3903
4	organic hass avocado	produce	155268	0.7971	31499	6.7534
5	organic avocado	produce	128942	0.7588	31102	6.4500
6	large lemon	produce	110732	0.6961	33648	7.9196
7	strawberries	produce	103918	0.6991	31274	7.1101
8	limes	produce	100932	0.6787	32429	8.5485
9	organic raspberries	produce	99192	0.7678	23036	7.2162

# Derive user features
user_order_enriched = prior_op.merge(data['orders'][['order_id','order_number','days_since_prior_order']], on = 'order_id', how = 'left', suffixes = ('','_o'))

user_stats = user_order_enriched.groupby('user_id').agg(total_orders          = ('order_id',                'nunique'),
                                                        total_items_purchased = ('product_id',              'count'),
                                                        unique_products       = ('product_id',              'nunique'),
                                                        avg_cart_size         = ('order_id', lambda x: x.count() / x.nunique()),
                                                        reorder_rate          = ('reordered',               'mean'),
                                                        avg_days_between      = ('days_since_prior_order',  'mean'),
                                                        std_days_between      = ('days_since_prior_order',  'std'),).reset_index()

user_stats['variety_ratio'] = (user_stats['unique_products'] / user_stats['total_items_purchased'])

print(f'User features shape: {user_stats.shape}')
print('\nUser statistics summary:')
display(user_stats[['total_orders','avg_cart_size','reorder_rate', 'avg_days_between','variety_ratio']].describe())

User features shape: (150000, 9)

User statistics summary:

	total_orders	avg_cart_size	reorder_rate	avg_days_between	variety_ratio
count	150000.0000	150000.0000	150000.0000	150000.0000	150000.0000
mean	15.5727	9.9431	0.4320	15.4821	0.5680
std	16.6764	5.8516	0.2121	7.2099	0.2121
min	3.0000	1.0000	0.0000	0.0000	0.0105
25%	5.0000	5.7500	0.2676	9.5582	0.4046
50%	9.0000	8.9333	0.4286	14.7143	0.5714
75%	19.0000	13.0000	0.5954	20.7500	0.7324
max	99.0000	70.2500	0.9895	30.0000	1.0000

# Find user/product interactions
up_stats = user_order_enriched.groupby(['user_id','product_id']).agg(times_ordered     = ('order_id',            'count'),
                                                                     times_reordered   = ('reordered',            'sum'),
                                                                     avg_cart_position = ('add_to_cart_order',    'mean'),
                                                                     last_order_num    = ('order_number',          'max')).reset_index()

user_max_order = (data['orders'][data['orders']['eval_set'] == 'prior'].groupby('user_id')['order_number'].max().reset_index().rename(columns = {'order_number': 'max_order_num'}))

up_stats = up_stats.merge(user_max_order, on='user_id', how='left')
up_stats['orders_since_last'] = up_stats['max_order_num'] - up_stats['last_order_num']
up_stats['reorder_rate']      = up_stats['times_reordered'] / up_stats['times_ordered'].clip(lower=1)

# Recency-weighted score: high orders + high reorder rate + low gap
up_stats['reorder_score']     = (up_stats['times_ordered'] * up_stats['reorder_rate']) / (up_stats['orders_since_last'] + 1)

print(f'User-product interaction features: {up_stats.shape}')
print(f'Avg interactions per user: {len(up_stats)/up_stats["user_id"].nunique():.1f} products')
up_stats.head()

User-product interaction features: (9657625, 10)
Avg interactions per user: 64.4 products

	user_id	product_id	times_ordered	times_reordered	avg_cart_position	last_order_num	max_order_num	orders_since_last	reorder_rate	reorder_score
0	1	196	10	9	1.4000	10	10	0	0.9000	9.0000
1	1	10258	9	8	3.3333	10	10	0	0.8889	8.0000
2	1	10326	1	0	5.0000	5	10	5	0.0000	0.0000
3	1	12427	10	9	3.3000	10	10	0	0.9000	9.0000
4	1	13032	3	2	6.3333	10	10	0	0.6667	2.0000

#Back to top..

7. Association Rule Mining (Market Basket Analysis)

We use Apriori on a stratified sample of prior orders.

Here we:

Sample from a larg prior set (≥5 000 orders)
Use min_support=0.01 (1% of sample baskets) so pairs are discoverable
Filter rules to lift > 1.0 (positive association only)

# Get a sample, increase for more comprehensive rules (takes longer)
rng = np.random.default_rng(100)
sample_uids = rng.choice(prior_op['user_id'].unique(), size = 50000, replace = False)

# Use the most recent prior order per sampled user (most predictive of future)
recent_prior_orders = (data['orders'][(data['orders']['user_id'].isin(sample_uids)) & (data['orders']['eval_set'] == 'prior')].sort_values(['user_id', 'order_number']).groupby('user_id').last().reset_index()[['user_id', 'order_id']])
sample_op = prior_op[prior_op['order_id'].isin(recent_prior_orders['order_id'])]

print(f'Sample baskets:        {sample_op["order_id"].nunique():,}')
print(f'Avg basket size:       {sample_op.groupby("order_id")["product"].count().mean():.1f} items')
print(f'Total unique products: {sample_op["product"].nunique():,}')

Sample baskets:        50,000
Avg basket size:       10.4 items
Total unique products: 30,430

# Reduce the product space by filtering the top products
top_products = (sample_op['product'].value_counts().head(5000).index.tolist())
top_product_set = set(top_products)

# Filter each basket to only include top-N products
filtered_baskets = (sample_op[sample_op['product'].isin(top_product_set)].groupby('order_id')['product'].apply(list).tolist())

# Drop baskets that now have fewer than 2 items (no pairs possible)
filtered_baskets = [b for b in filtered_baskets if len(b) >= 2]

print(f'Baskets after filtering:       {len(filtered_baskets):,}')
print(f'Avg basket size (top-N only):  {np.mean([len(b) for b in filtered_baskets]):.1f} items')
print(f'Product space:                 {len(top_products):,} items')
print(f'\nMin support threshold (1%):    {int(0.01 * len(filtered_baskets)):,} baskets required')

Baskets after filtering:       45,405
Avg basket size (top-N only):  9.1 items
Product space:                 1000 items

Min support threshold (1%):    454 baskets required

# Perform apriori
te = TransactionEncoder()
te_array = te.fit_transform(filtered_baskets)
basket_df = pd.DataFrame(te_array, columns = te.columns_)

# Get frequencies
frequent_itemsets = apriori(basket_df, min_support = .01, use_colnames = True, max_len = 2)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(len)

# Mine the rules
rules = association_rules(frequent_itemsets, metric = 'confidence', min_threshold = 0.05)
rules = rules[rules['lift'] > 1.0].sort_values('lift', ascending = False).reset_index(drop = True)

# Summarize results
print(f'Frequent itemsets: {len(frequent_itemsets):,}')
print(f'  Singletons: {(frequent_itemsets["length"]==1).sum()}')
print(f'  Pairs:      {(frequent_itemsets["length"]==2).sum()}')
print(f'\nAssociation rules (lift > 1): {len(rules):,}')
print('\nTop 20 rules by lift:')
display(rules[['antecedents','consequents','support','confidence','lift']].head(20).assign(antecedents=lambda df: df['antecedents'].apply(lambda x: list(x)[0]), consequents = lambda df: df['consequents'].apply(lambda x: list(x)[0])))

Frequent itemsets: 134
  Singletons: 115
  Pairs:      19

Association rules (lift > 1): 38

Top 20 rules by lift:

	antecedents	consequents	support	confidence	lift
0	large lemon	limes	0.0115	0.1746	3.6959
1	limes	large lemon	0.0115	0.2429	3.6959
2	organic raspberries	organic strawberries	0.0121	0.2622	3.1958
3	organic strawberries	organic raspberries	0.0121	0.1474	3.1958
4	bag of organic bananas	organic hass avocado	0.0198	0.1536	2.5129
5	organic hass avocado	bag of organic bananas	0.0198	0.3242	2.5129
6	organic fuji apple	banana	0.0105	0.3856	2.4618
7	banana	organic fuji apple	0.0105	0.0671	2.4618
8	organic raspberries	bag of organic bananas	0.0143	0.3095	2.3986
9	bag of organic bananas	organic raspberries	0.0143	0.1106	2.3986
10	organic hass avocado	organic strawberries	0.0117	0.1913	2.3316
11	organic strawberries	organic hass avocado	0.0117	0.1426	2.3316
12	banana	honeycrisp apple	0.0102	0.0648	2.3303
13	honeycrisp apple	banana	0.0102	0.3650	2.3303
14	organic baby spinach	organic avocado	0.0117	0.1406	2.2902
15	organic avocado	organic baby spinach	0.0117	0.1905	2.2902
16	organic strawberries	bag of organic bananas	0.0231	0.2813	2.1807
17	bag of organic bananas	organic strawberries	0.0231	0.1789	2.1807
18	organic hass avocado	organic baby spinach	0.0105	0.1718	2.0662
19	organic baby spinach	organic hass avocado	0.0105	0.1263	2.0662

# Visualize association rules
fig, axes = plt.subplots(1, 2, figsize=(20, 6))

# Scatter: support × confidence, coloured by lift
sc = axes[0].scatter(rules['support'], rules['confidence'], c = rules['lift'], cmap = 'YlOrRd', alpha = 0.7, s = 40, edgecolors = 'gray', linewidth = 0.3)
plt.colorbar(sc, ax = axes[0], label = 'Lift')
axes[0].set_xlabel('Support'); axes[0].set_ylabel('Confidence')
axes[0].set_title('Support × Confidence (colour = Lift)', fontsize=12, fontweight='bold')

# Top 15 rules by lift — horizontal bar
top15 = rules.head(15).copy()
top15['rule_label'] = top15.apply(lambda r: f"{list(r['antecedents'])[0][:22]} → {list(r['consequents'])[0][:22]}", axis=1)
bar_colors = sns.color_palette('YlOrRd', 15)
axes[1].barh(top15['rule_label'][::-1], top15['lift'][::-1], color=bar_colors)
axes[1].axvline(1.0, color='gray', linestyle='--', alpha=0.6, label='Lift = 1 (independent)')
axes[1].set_xlabel('Lift'); axes[1].set_title('Top 15 Rules by Lift', fontsize=12, fontweight='bold')
axes[1].legend(); axes[1].tick_params(axis='y', labelsize=8)

plt.suptitle('Market Basket — Association Rules', fontsize=15, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

png

# Build rule lookup dicts (used by prediction model) ant_frozenset → [(confidence, lift, consequent_name)]
rule_lookup = defaultdict(list)

for _, row in rules.iterrows():

    ant = frozenset(row['antecedents'])
    con = list(row['consequents'])[0]
    rule_lookup[ant].append((row['confidence'], row['lift'], con))

# single item → [(conf, lift, consequent, full_antecedent)]
rule_single = defaultdict(list)

for ant, recs in rule_lookup.items():

    for item in ant:

        for conf, lift, con in recs:

            rule_single[item].append((conf, lift, con, ant))

# Helper lookup dicts
pid_to_name = dict(zip(products['product_id'], products['product']))
name_to_pid = dict(zip(products['product'], products['product_id']))
pid_to_dept = dict(zip(products['product_id'], products['department']))

print(f'Rule lookup entries: {len(rule_lookup)}')
print(f'Single-item lookup:  {len(rule_single)} distinct items have rules')

Rule lookup entries: 12
Single-item lookup:  12 distinct items have rules

8. Next-Item Prediction Strategies

Five strategies recommend products as the user builds their cart, each progressively richer in how it uses purchase history and the sequence of items already added:

#	Strategy	Cart-Order Signal	ML?	Cold-start fallback
S1	Global Popularity	—	No	— (is the fallback)
S2	Personal History	—	No	Falls back to S1
S3	Association Rules	Cart contents	No	Falls back to S2
S4	Sequential Markov Chain	Transition + position probs	No	Falls back to S2
S5	LightGBM (cart-aware)	Markov + position features	✅	Falls back to S1

All strategies share the same interface:

fn(user_id, cart_sequence, K=10) → [product_id, ...]

where cart_sequence is the ordered list of product_ids added to cart so far. This uniform signature makes it trivial to swap strategies in the evaluation harness and the final demo.

# Develop the ground truth for model scoring
train_truth = (train_op.groupby('user_id')['product_id'].apply(set).reset_index().rename(columns = {'product_id': 'true_items'}))

# Simulate "current shopping session" using the LAST prior order per user,
last_prior_oids = (orders_slim[orders_slim['eval_set'] == 'prior'].sort_values(['user_id', 'order_number']).groupby('user_id')['order_id'].last().reset_index().rename(columns = {'order_id': 'last_oid'}))

last_prior_seq = (prior_op[prior_op['order_id'].isin(last_prior_oids['last_oid'])].sort_values(['order_id', 'add_to_cart_order']).groupby('order_id')['product_id'].apply(list).reset_index().rename(columns = {'order_id': 'last_oid', 'product_id': 'cart_sequence'}))

eval_df = (train_truth.merge(last_prior_oids, on = 'user_id').merge(last_prior_seq, on = 'last_oid', how = 'inner'))

# Reproducible subsample for fast evaluation across all strategies
eval_sample = eval_df.sample(n = min(3000, len(eval_df)), random_state = 100).reset_index(drop = True)

print(f"Evaluation set  : {len(eval_df):,} users")
print(f"Eval sample     : {len(eval_sample):,} users")
print(f"Avg true basket : {eval_sample['true_items'].apply(len).mean():.1f} items")
print(f"Avg cart length : {eval_sample['cart_sequence'].apply(len).mean():.1f} items")
eval_sample[['user_id','cart_sequence','true_items']].head(3)

Evaluation set  : 95,361 users
Eval sample     : 3,000 users
Avg true basket : 10.5 items
Avg cart length : 10.6 items

	user_id	cart_sequence	true_items
0	15188	[34915, 47766, 43183, 35406, 49054, 39993, 457...	{46656, 34915, 46724, 264, 41290, 35406, 5296,...
1	23617	[41177]	{42585, 18740}
2	13611	[38895, 24852, 9076, 16617, 33294, 45007, 2784...	{17794, 44359, 2380, 40174, 38895, 1359, 47601...

# ─────────────────────────────────────────────────────────────────────────────
# Strategy 1 — Global Popularity Baseline
# ─────────────────────────────────────────────────────────────────────────────
# Recommend the most-ordered products globally.  No personalisation, no cart
# context — pure frequency signal.  Serves as the floor baseline.

POPULARITY_POOL = (product_stats.sort_values('order_count', ascending=False)['product_id'].head(300).tolist())

def s1_popularity(user_id, cart_sequence, K = 10, **_):

    """Top-K globally popular products not already in the cart."""
    in_cart = set(cart_sequence)
    return [p for p in POPULARITY_POOL if p not in in_cart][:K]

print("✅  S1 (Global Popularity) ready")
print(f"   Pool: top {len(POPULARITY_POOL)} products")
print(f"   Sample recs: {[pid_to_name.get(p,'?')[:25] for p in s1_popularity(0, [], K=5)]}")

✅  S1 (Global Popularity) ready
   Pool: top 300 products
   Sample recs: ['banana', 'bag of organic bananas', 'organic strawberries', 'organic baby spinach', 'organic hass avocado']

# ─────────────────────────────────────────────────────────────────────────────
# Strategy 2 — Personal History
# ─────────────────────────────────────────────────────────────────────────────
# Rank the user's previously purchased products by reorder_score
# (frequency × reorder-rate / recency gap), then return the top-K
# not yet in the current cart.

user_history = (up_stats.sort_values(['user_id', 'reorder_score'], ascending=[True, False]).groupby('user_id')['product_id'].apply(list).to_dict())

def s2_personal(user_id, cart_sequence, K = 10, **_):

    """Top-K products from user's personal purchase history by reorder_score."""
    in_cart = set(cart_sequence)
    return [p for p in user_history.get(user_id, []) if p not in in_cart][:K]

coverage = sum(1 for uid in eval_sample['user_id'] if uid in user_history)
print("✅  S2 (Personal History) ready")
print(f"   History coverage : {coverage}/{len(eval_sample)} ({coverage/len(eval_sample):.1%})")

_u0 = eval_sample['user_id'].iloc[0]
_c0 = eval_sample['cart_sequence'].iloc[0]
print(f"   Sample recs (user {_u0}): {[pid_to_name.get(p,'?')[:22] for p in s2_personal(_u0,_c0,K=3)]}")

✅  S2 (Personal History) ready
   History coverage : 3000/3000 (100.0%)
   Sample recs (user 15188): ['sweet red grape tomato', 'banana', 'whole wheat bread']

# ─────────────────────────────────────────────────────────────────────────────
# Strategy 3 — Association Rules  (cart-content-aware)
# ─────────────────────────────────────────────────────────────────────────────
# Every item currently in the cart fires its association rules.
# Consequents are scored by the SUM of (confidence × lift) across all
# triggering cart items, then ranked.  Falls back to personal history
# when the rule set yields fewer than K candidates.

# Build a fast name-keyed lookup: item_name → {consequent_name: agg_score}
assoc_lookup = defaultdict(lambda: defaultdict(float))

for ant_name, recs in rule_single.items():

    for conf, lift, con_name, _ in recs:

        assoc_lookup[ant_name][con_name] += conf * lift

def s3_rules(user_id, cart_sequence, K=10, **_):

    """Score candidates via association rules fired by the current cart."""
    in_cart_ids  = set(cart_sequence)
    in_cart_names = {pid_to_name.get(p, '') for p in cart_sequence}

    scores = defaultdict(float)

    for name in in_cart_names:

        for con_name, score in assoc_lookup.get(name, {}).items():

            pid = name_to_pid.get(con_name)

            if pid and pid not in in_cart_ids:

                scores[pid] += score

    ranked = sorted(scores, key=scores.get, reverse=True)

    if len(ranked) >= K:

        return ranked[:K]

    # Backfill with personal history when rules are sparse
    extra = [p for p in s2_personal(user_id, cart_sequence, K * 3) if p not in set(ranked)]

    return (ranked + extra)[:K]

print("✅  S3 (Association Rules) ready")
print(f"   Assoc lookup items : {len(assoc_lookup)}")
print(f"   Sample recs : {[pid_to_name.get(p,'?')[:22] for p in s3_rules(_u0,_c0,K = 3)]}")

✅  S3 (Association Rules) ready
   Assoc lookup items : 12
   Sample recs : ['banana', 'organic baby spinach', 'sweet red grape tomato']

# ─────────────────────────────────────────────────────────────────────────────
# Strategy 4 — Sequential Markov Chain  (O(N) single-pass, no loops)
# ─────────────────────────────────────────────────────────────────────────────
MAX_POS        = 20
MAX_SUCCESSORS = 100
MAX_POS_ITEMS  = 200
MAX_USER_HIST  = 50

print("Building Markov transition matrix…")
t0 = time.time()

# ── Step 1: extract + sort ────────────────────────────────────────────────────
arr  = prior_op[['order_id','product_id','add_to_cart_order']].to_numpy(dtype = 'int32')
arr  = arr[np.lexsort((arr[:,2], arr[:,0]))]

oids = arr[:,0]; pids = arr[:,1]; pos0 = arr[:,2] - 1
del arr; gc.collect()

# ── Step 2: transition pairs ──────────────────────────────────────────────────
same   = oids[:-1] == oids[1:]
prev_p = pids[:-1][same].astype('int32')
next_p = pids[1: ][same].astype('int32')
del same; gc.collect()

# Sort pairs so identical (prev,next) rows are adjacent → count with diff
order  = np.lexsort((next_p, prev_p))
prev_s = prev_p[order]; next_s = next_p[order]
del prev_p, next_p, order; gc.collect()

# Find group boundaries in O(N) — no per-node mask scan
boundaries = np.flatnonzero(np.diff(prev_s)) + 1
boundaries = np.concatenate([[0], boundaries, [len(prev_s)]])

# Count runs of identical (prev,next) using diff on next_s within each prev group
pair_counts = np.diff(
    np.concatenate([np.flatnonzero(
        np.concatenate([[True], (prev_s[1:] != prev_s[:-1]) | (next_s[1:] != next_s[:-1]), [True]])
    ), [len(prev_s)]])
)
# Unique (prev,next) pairs
keep = np.concatenate([[True], (prev_s[1:] != prev_s[:-1]) | (next_s[1:] != next_s[:-1])])
u_prev = prev_s[keep]; u_next = next_s[keep]
del prev_s, next_s, keep; gc.collect()

# Group boundaries on the unique pairs
grp_bounds = np.flatnonzero(np.diff(u_prev)) + 1
grp_bounds = np.concatenate([[0], grp_bounds, [len(u_prev)]])

trans_prob = {}
for i in range(len(grp_bounds) - 1):
    lo, hi = grp_bounds[i], grp_bounds[i+1]
    node = int(u_prev[lo])
    c = pair_counts[lo:hi]; t = u_next[lo:hi]
    if len(c) > MAX_SUCCESSORS:
        top = np.argpartition(c, -MAX_SUCCESSORS)[-MAX_SUCCESSORS:]
        c, t = c[top], t[top]
    total = c.sum()
    trans_prob[node] = {int(ti): float(ci/total) for ti, ci in zip(t, c)}

del u_prev, u_next, pair_counts, grp_bounds; gc.collect()
print(f"  trans_prob done  ({len(trans_prob):,} nodes)  {time.time()-t0:.1f}s")

# ── Step 3: position-popularity ───────────────────────────────────────────────
mask  = pos0 < MAX_POS
p_pos = pos0[mask].astype('int32'); p_pid = pids[mask].astype('int32')
del oids, pids, pos0, mask; gc.collect()

order2  = np.lexsort((p_pid, p_pos))
pp_s = p_pos[order2]; pid_s = p_pid[order2]
del p_pos, p_pid, order2; gc.collect()

keep2 = np.concatenate([[True], (pp_s[1:] != pp_s[:-1]) | (pid_s[1:] != pid_s[:-1])])
pc = np.diff(np.concatenate([np.flatnonzero(keep2), [len(pp_s)]]))
u_pos = pp_s[keep2]; u_pid = pid_s[keep2]
del pp_s, pid_s, keep2; gc.collect()

grp2 = np.flatnonzero(np.diff(u_pos)) + 1
grp2 = np.concatenate([[0], grp2, [len(u_pos)]])

pos_prob = {}
for i in range(len(grp2) - 1):
    lo, hi = grp2[i], grp2[i+1]
    pos = int(u_pos[lo])
    c = pc[lo:hi]; t = u_pid[lo:hi]
    if len(c) > MAX_POS_ITEMS:
        top = np.argpartition(c, -MAX_POS_ITEMS)[-MAX_POS_ITEMS:]
        c, t = c[top], t[top]
    total = c.sum()
    pos_prob[pos] = {int(ti): float(ci/total) for ti, ci in zip(t, c)}

del u_pos, u_pid, pc, grp2; gc.collect()

# ── Step 4: per-user reorder-score list (capped) ─────────────────────────────
user_rs_list = (up_stats
                .sort_values('reorder_score', ascending=False)
                .groupby('user_id').head(MAX_USER_HIST)
                .groupby('user_id')
                .apply(lambda g: list(zip(g['product_id'].tolist(),
                                          g['reorder_score'].tolist())))
                .to_dict())

print(f"✅  Markov chain built in {time.time()-t0:.1f}s")
print(f"   Source nodes            : {len(trans_prob):,} products")
print(f"   Avg successors per node : {np.mean([len(v) for v in trans_prob.values()]):.1f}")
print(f"   Positions tracked       : 0 – {MAX_POS-1}")

Building Markov transition matrix…
  trans_prob done  (49,411 nodes)  24.8s
✅  Markov chain built in 67.9s
   Source nodes            : 49,411 products
   Avg successors per node : 48.3
   Positions tracked       : 0 – 19

# Strategy 4 — Sequential Markov Chain inference function
def s4_markov(user_id, cart_sequence, K = 10, n_gram = 2, alpha = 0.4, beta = 0.25, **_):

    """
    Sequential Markov Chain recommender.

    Score(X) = Σ P(next=X | prev ∈ cart[-n_gram:])          [Markov]
             + α · P(X appears at cart position len(cart))   [position]
             + β · reorder_score(user, X)                    [personal]

    Parameters
    ----------
    cart_sequence : list of product_ids in add-to-cart order
    n_gram        : how many trailing items to use as context (default 2)
    alpha         : weight for the position-popularity signal
    beta          : weight for the user's personal reorder score
    """
    in_cart  = set(cart_sequence)
    next_pos = len(cart_sequence)
    context  = cart_sequence[-n_gram:] if len(cart_sequence) >= n_gram else cart_sequence

    scores = defaultdict(float)

    # (A) Markov transition scores from the last n_gram cart items
    for prev in context:
        for nxt, prob in trans_prob.get(prev, {}).items():
            if nxt not in in_cart:
                scores[nxt] += prob

    # (B) Position-specific popularity at the *next* cart slot
    capped_pos = min(next_pos, MAX_POS - 1)
    for pid, prob in pos_prob.get(capped_pos, {}).items():
        if pid not in in_cart:
            scores[pid] += alpha * prob

    # (C) Blend personal purchase affinity
    for pid, rs in user_rs_list.get(user_id, []):
        if pid not in in_cart and rs > 0:
            scores[pid] += beta * float(rs)

    ranked = sorted(scores, key=scores.get, reverse=True)
    if len(ranked) >= K:
        return ranked[:K]

    extra = [p for p in s2_personal(user_id, cart_sequence, K * 2) if p not in set(ranked)]
    return (ranked + extra)[:K]

# Apply function
print("✅  S4 (Sequential Markov Chain) ready")
recs4 = s4_markov(_u0, _c0, K=3)
print(f"   Sample recs : {[pid_to_name.get(p,'?')[:22] for p in recs4]}")
print()
print("Position signal demo — top 3 products per cart slot:")
print(f"{'Pos':>4}  {'#1 product':<30} {'#2 product':<30} {'#3 product':<30}")
print("─" * 97)

for pos in range(6):
    slot = pos_prob.get(pos, {})
    top3 = sorted(slot, key=slot.get, reverse=True)[:3]
    cols = [f"{pid_to_name.get(p,'?')[:26]} ({slot[p]:.2%})" for p in top3]
    while len(cols) < 3:
        cols.append("—")
    print(f"{pos:>4}  {cols[0]:<30} {cols[1]:<30} {cols[2]:<30}")

✅  S4 (Sequential Markov Chain) ready
   Sample recs : ['sweet red grape tomato', 'banana', 'whole wheat bread']

Position signal demo — top 3 products per cart slot:
 Pos  #1 product                     #2 product                     #3 product                    
─────────────────────────────────────────────────────────────────────────────────────────────────
   0  banana (9.46%)                 bag of organic bananas (6.78%) organic whole milk (2.59%)    
   1  banana (7.48%)                 bag of organic bananas (5.99%) organic strawberries (2.78%)  
   2  banana (6.07%)                 bag of organic bananas (5.05%) organic strawberries (2.89%)  
   3  banana (5.04%)                 bag of organic bananas (4.25%) organic strawberries (2.91%)  
   4  banana (4.33%)                 bag of organic bananas (3.71%) organic strawberries (2.87%)  
   5  banana (3.95%)                 bag of organic bananas (3.29%) organic strawberries (2.77%)  

# ─────────────────────────────────────────────────────────────────────────────
# Strategy 5 — LightGBM feature matrix  (chunked, bounded candidate set)
# ─────────────────────────────────────────────────────────────────────────────
# Why previous versions OOM'd:
#   • Full candidate matrix (8000 users × 300+ candidates) = 2.4M rows
#     before a single feature column is added — too large to hold in RAM
#     alongside prior_op, trans_prob, etc.
#
# Fix — two constraints that bound memory absolutely:
#   1. CHUNK_SIZE users processed at a time; each chunk is del'd before next
#   2. Candidate pool capped: top MAX_PERSONAL history + top MAX_GLOBAL pool
#      → at most (MAX_PERSONAL + MAX_GLOBAL) rows per user, guaranteed

ML_N_USERS   = 3000   # reduce if still tight; 3k is plenty for LightGBM
CHUNK_SIZE   = 200    # users per chunk — peak RAM ≈ 200 × 150 × ~20 cols
MAX_PERSONAL = 80     # top items from user history by reorder_score
MAX_GLOBAL   = 80     # top globally popular items
# → max ~160 candidates per user, ~32k rows per chunk

ml_train_users = eval_df['user_id'].sample(n=ML_N_USERS, random_state=42).tolist()
ml_train_df    = (eval_df[eval_df['user_id'].isin(ml_train_users)]
                  .reset_index(drop=True))

# Static lookups built once, outside the loop
prod_feat = product_stats[['product_id','order_count','reorder_rate', 'avg_cart_position','global_popularity']]
prod_feat['pop_rank'] = prod_feat['order_count'].rank(pct = True).astype('float32')

user_feat = user_stats[['user_id','total_orders','avg_cart_size',
                         'reorder_rate','avg_days_between','variety_ratio']]

up_feat = up_stats[['user_id','product_id','times_ordered','reorder_score', 'orders_since_last','avg_cart_position','reorder_rate']]

up_feat.rename(columns={'avg_cart_position':'up_avg_cart_pos', 'reorder_rate':'up_reorder_rate'}, inplace=True)

# Cap global pool
GLOBAL_POOL = POPULARITY_POOL[:MAX_GLOBAL]

# Pre-convert assoc_lookup to pid-keyed dict once (avoids repeated name lookups)
assoc_by_pid = defaultdict(dict)

for ant_name, cons in assoc_lookup.items():

    ant_pid = name_to_pid.get(ant_name)

    if not ant_pid:

        continue

    for con_name, score in cons.items():

        con_pid = name_to_pid.get(con_name)

        if con_pid:

            assoc_by_pid[ant_pid][con_pid] = (assoc_by_pid[ant_pid].get(con_pid, 0.0) + score)

print(f"Users: {ML_N_USERS}  |  chunk: {CHUNK_SIZE}  |  "
      f"max candidates/user: {MAX_PERSONAL + MAX_GLOBAL}")
print(f"Expected rows: ~{ML_N_USERS * (MAX_PERSONAL + MAX_GLOBAL) // 2:,}")

t0 = time.time()
chunks = []

for chunk_start in range(0, len(ml_train_df), CHUNK_SIZE):

    chunk = ml_train_df.iloc[chunk_start : chunk_start + CHUNK_SIZE]
    rows  = []

    for _, row in chunk.iterrows():

        uid      = int(row['user_id'])
        true_set = row['true_items']
        cart_seq = row['cart_sequence']
        in_cart  = set(cart_seq)
        cart_sz  = len(cart_seq)

        # Bounded candidate set
        personal = [p for p in user_history.get(uid, [])[:MAX_PERSONAL]
                    if p not in in_cart]
        globl    = [p for p in GLOBAL_POOL if p not in in_cart and p not in set(personal)]
        candidates = personal + globl

        if not candidates:
            continue

        # Cart-context scores — computed once per user, not per candidate
        markov_sc = defaultdict(float)
        for prev in cart_seq:
            for nxt, p in trans_prob.get(prev, {}).items():
                markov_sc[nxt] += p

        assoc_sc = defaultdict(float)
        for ant_pid in cart_seq:
            for con_pid, sc in assoc_by_pid.get(ant_pid, {}).items():
                assoc_sc[con_pid] += sc

        capped_pos = min(cart_sz, MAX_POS - 1)
        pos_p      = pos_prob.get(capped_pos, {})

        for pid in candidates:
            rows.append((uid, pid,
                         int(pid in true_set),
                         cart_sz,
                         markov_sc.get(pid, 0.0),
                         assoc_sc.get(pid, 0.0),
                         pos_p.get(pid, 0.0)))

    if not rows:
        continue

    c_df = pd.DataFrame(rows, columns=['user_id','product_id','label',
                                        'cart_size','markov_score',
                                        'assoc_score','pos_prob_next'])
    c_df = (c_df
            .merge(prod_feat, on='product_id', how='left')
            .merge(user_feat, on='user_id',    how='left')
            .merge(up_feat,   on=['user_id','product_id'], how='left')
            .fillna(0))

    chunks.append(c_df)
    del c_df, rows
    gc.collect()

    if (chunk_start // CHUNK_SIZE) % 5 == 0:
        done = min(chunk_start + CHUNK_SIZE, len(ml_train_df))
        print(f"  {done}/{len(ml_train_df)} users …", end="")

cand_df = pd.concat(chunks, ignore_index=True)
del chunks
gc.collect()

pos_rate = cand_df['label'].mean()
print(f"\nDone in {time.time()-t0:.1f}s  |  rows: {len(cand_df):,}  "
      f"|  positive rate: {pos_rate:.3%}")
cand_df[['user_id','product_id','label','markov_score','assoc_score',
         'pos_prob_next','cart_size','reorder_score']].head(5)

Users: 3000  |  chunk: 200  |  max candidates/user: 160
Expected rows: ~240,000
  200/3000 users …  1200/3000 users …  2200/3000 users …
Done in 80.3s  |  rows: 333,300  |  positive rate: 3.461%

	user_id	product_id	markov_score	pos_prob_next	cart_size	reorder_score
0	144	42342	0.0394	0.0000	6	0.3333
1	144	13603	0.0000	0.0000	6	0.1667
2	144	22963	0.0752	0.0030	6	0.0000
3	144	29871	0.0233	0.0000	6	0.0000
4	144	32887	0.0000	0.0000	6	0.0000

# Define feature columns and train LightGBM with group-aware train/val split
FEATURE_COLS = ['cart_size', 'markov_score', 'assoc_score', 'pos_prob_next', 'order_count', 'reorder_rate', 'avg_cart_position', 'global_popularity', 'pop_rank', 'total_orders', 'avg_cart_size', 'reorder_rate_user', 'avg_days_between', 'variety_ratio', 'times_ordered', 'reorder_score', 'orders_since_last', 'up_avg_cart_pos', 'up_reorder_rate',]

# Some column names duplicated (reorder_rate from product + user) — rename defensively
cols = list(cand_df.columns)
seen = {}; new_cols = []

for c in cols:

    if c in seen:

        seen[c] += 1
        new_cols.append(f"{c}_user" if c == "reorder_rate" else f"{c}_{seen[c]}")

    else:

        seen[c] = 0; new_cols.append(c)

cand_df.columns = new_cols

FEATURE_COLS = [c for c in FEATURE_COLS if c in cand_df.columns]

X = cand_df[FEATURE_COLS].values
y = cand_df['label'].values
groups = cand_df['user_id'].values

# Group-aware train/val split (no user leaks between folds)
gss  = GroupShuffleSplit(n_splits=1, test_size=0.15, random_state=42)
tr_idx, va_idx = next(gss.split(X, y, groups))

X_tr, y_tr = X[tr_idx], y[tr_idx]
X_va, y_va = X[va_idx], y[va_idx]

# Scale negatives to 1 positive ~ 4 negatives (handles class imbalance)
pos_weight = max(1, int((y_tr == 0).sum() / max((y_tr == 1).sum(), 1) / 4))

lgb_model = lgb.LGBMClassifier(
    n_estimators    = 400,
    learning_rate   = 0.05,
    num_leaves      = 63,
    max_depth       = -1,
    min_child_samples = 20,
    subsample       = 0.8,
    colsample_bytree= 0.8,
    scale_pos_weight= pos_weight,
    n_jobs          = -1,
    random_state    = 42,
    verbose         = -1,
)

print("Training LightGBM …")
t0 = time.time()
lgb_model.fit(X_tr, y_tr,
              eval_set  = [(X_va, y_va)],
              callbacks = [lgb.early_stopping(30, verbose=False),
                           lgb.log_evaluation(50)])
print(f"Done in {time.time()-t0:.1f}s  |  best iteration: {lgb_model.best_iteration_}")

va_pred = lgb_model.predict_proba(X_va)[:, 1]
print(f"\nVal ROC-AUC  : {roc_auc_score(y_va, va_pred):.4f}")
print(f"Val Avg-Prec : {average_precision_score(y_va, va_pred):.4f}")

# Feature importance
fi = pd.Series(lgb_model.feature_importances_, index=FEATURE_COLS).sort_values(ascending=False)
print("\nTop-10 feature importances:")
print(fi.head(10).to_string())

Training LightGBM …
Done in 1.9s  |  best iteration: 4

Val ROC-AUC  : 0.8648
Val Avg-Prec : 0.2269

Top-10 feature importances:
orders_since_last    32
avg_cart_size        32
markov_score         26
variety_ratio        24
reorder_score        22
order_count          22
total_orders         21
cart_size            19
avg_cart_position    17
avg_days_between     12

# Strategy 5 — LightGBM inference function (mirrors training feature pipeline)
def s5_lgbm(user_id, cart_sequence, K=10, **_):
    """LightGBM recommender — mirrors the training feature pipeline exactly."""
    in_cart = set(cart_sequence)
    cart_sz = len(cart_sequence)

    # Bounded candidates — same caps used at training time
    personal   = [p for p in user_history.get(user_id, [])[:MAX_PERSONAL] if p not in in_cart]
    globl      = [p for p in GLOBAL_POOL if p not in in_cart and p not in set(personal)]
    candidates = personal + globl

    if not candidates:

        return s1_popularity(user_id, cart_sequence, K)

    # Cart-context scores
    markov_sc = defaultdict(float)

    for prev in cart_sequence:

        for nxt, p in trans_prob.get(prev, {}).items():

            markov_sc[nxt] += p

    assoc_sc = defaultdict(float)                    # use pid-keyed dict (matches training)

    for ant_pid in cart_sequence:

        for con_pid, sc in assoc_by_pid.get(ant_pid, {}).items():

            assoc_sc[con_pid] += sc

    pos_p = pos_prob.get(min(cart_sz, MAX_POS - 1), {})

    rows = [{"user_id": user_id, "product_id": int(pid),
             "cart_size": cart_sz,
             "markov_score":  float(markov_sc.get(pid, 0.0)),
             "assoc_score":   float(assoc_sc.get(pid, 0.0)),
             "pos_prob_next": float(pos_p.get(pid, 0.0))}
            for pid in candidates]

    inf_df = (pd.DataFrame(rows)
              .merge(prod_feat, on="product_id", how="left")
              .merge(user_feat, on="user_id",    how="left")
              .merge(up_feat,   on=["user_id","product_id"], how="left")
              .fillna(0))

    # Apply the same column rename as training
    cols = list(inf_df.columns); seen = {}; new_cols = []
    for c in cols:
        if c in seen:
            seen[c] += 1
            new_cols.append(f"{c}_user" if c == "reorder_rate" else f"{c}_{seen[c]}")
        else:
            seen[c] = 0; new_cols.append(c)
    inf_df.columns = new_cols

    for fc in FEATURE_COLS:              # guarantee column alignment
        if fc not in inf_df.columns:
            inf_df[fc] = 0.0

    scores = lgb_model.predict_proba(inf_df[FEATURE_COLS].values)[:, 1]
    inf_df["score"] = scores
    return inf_df.nlargest(K, "score")["product_id"].tolist()

print("✅  S5 (LightGBM) ready")
recs5 = s5_lgbm(_u0, _c0, K=3)
print(f"   Sample recs : {[pid_to_name.get(p,'?')[:25] for p in recs5]}")

✅  S5 (LightGBM) ready
   Sample recs : ['banana', 'sweet red grape tomatoes', 'whole wheat bread']

9. Model Evaluation

Each strategy is scored on the same held-out eval_sample using five retrieval metrics at K = 5 and K = 10. The evaluation simulates a real shopping session: each user’s last prior order serves as the current cart, and we check whether the strategy’s top-K recommendations intersect with the user’s actual train basket.

Metric	Formula	What it captures
Precision\@K	\|recs ∩ true\| / K	Of K recommendations, what fraction are relevant
Recall\@K	\|recs ∩ true\| / \|true\|	What fraction of the true basket was surfaced
F1\@K	2·P·R / (P+R)	Harmonic mean — balanced view of P and R
NDCG\@K	DCG / IDCG	Ranking quality — rewards relevant items appearing earlier in the list
Hit Rate\@K	1 if any hit else 0	Fraction of users who received ≥ 1 relevant recommendation

# Scoring functions
def precision_at_k(recs, true_set, k):
    return len(set(recs[:k]) & true_set) / k if k else 0.0

def recall_at_k(recs, true_set, k):
    return len(set(recs[:k]) & true_set) / len(true_set) if true_set else 0.0

def f1_at_k(recs, true_set, k):
    p = precision_at_k(recs, true_set, k)
    r = recall_at_k(recs, true_set, k)
    return 2*p*r/(p+r) if (p+r) > 0 else 0.0

def ndcg_at_k(recs, true_set, k):
    dcg   = sum(1/np.log2(i+2) for i, p in enumerate(recs[:k]) if p in true_set)
    ideal = sum(1/np.log2(i+2) for i in range(min(len(true_set), k)))
    return dcg/ideal if ideal else 0.0

def score_recs(recs, true_set, K):
    return {
        "precision": precision_at_k(recs, true_set, K),
        "recall":    recall_at_k(recs, true_set, K),
        "f1":        f1_at_k(recs, true_set, K),
        "ndcg":      ndcg_at_k(recs, true_set, K),
        "hit_rate":  1 if any(p in true_set for p in recs[:K]) else 0,
    }

def evaluate_strategy(fn, data, K=10, n=None, seed = 100):

    """Standard per-user evaluator — used for S1–S4."""
    rows = data if n is None else data.sample(n=n, random_state=seed)
    m = {"precision":[],"recall":[],"f1":[],"ndcg":[],"hit_rate":[]}
    for _, row in rows.iterrows():
        s = score_recs(fn(row["user_id"], row["cart_sequence"], K=K), row["true_items"], K)
        for key in m: m[key].append(s[key])
    return {k: float(np.mean(v)) for k, v in m.items()}

def evaluate_lgbm_batched(eval_data, K = 10, n = None, seed = 100, chunk_size=200):

    """
    Batched evaluator for S5 — builds the full feature matrix for
    chunk_size users at once, calls predict_proba once per chunk,
    then scores. Avoids 3000 individual DataFrame merges.
    """
    rows_df = eval_data if n is None else eval_data.sample(n=n, random_state=seed)
    rows_df = rows_df.reset_index(drop=True)
    m = {"precision":[],"recall":[],"f1":[],"ndcg":[],"hit_rate":[]}

    for chunk_start in range(0, len(rows_df), chunk_size):

        chunk    = rows_df.iloc[chunk_start:chunk_start + chunk_size]
        cand_rows = []

        for idx, row in chunk.iterrows():

            uid      = int(row["user_id"])
            cart_seq = row["cart_sequence"]
            in_cart  = set(cart_seq)
            cart_sz  = len(cart_seq)

            personal   = [p for p in user_history.get(uid, [])[:MAX_PERSONAL] if p not in in_cart]
            globl      = [p for p in GLOBAL_POOL if p not in in_cart and p not in set(personal)]
            candidates = personal + globl

            if not candidates:

                candidates = s1_popularity(uid, cart_seq, K)

            markov_sc = defaultdict(float)

            for prev in cart_seq:

                for nxt, p in trans_prob.get(prev, {}).items():

                    markov_sc[nxt] += p

            assoc_sc = defaultdict(float)

            for ant_pid in cart_seq:

                for con_pid, sc in assoc_by_pid.get(ant_pid, {}).items():

                    assoc_sc[con_pid] += sc

            pos_p = pos_prob.get(min(cart_sz, MAX_POS - 1), {})

            for pid in candidates:
                cand_rows.append((idx, uid, int(pid), cart_sz,
                                  float(markov_sc.get(pid, 0.0)),
                                  float(assoc_sc.get(pid, 0.0)),
                                  float(pos_p.get(pid, 0.0))))

        if not cand_rows:
            continue

        c_df = pd.DataFrame(cand_rows, columns=[
            "idx","user_id","product_id","cart_size",
            "markov_score","assoc_score","pos_prob_next"])
        c_df = (c_df
                .merge(prod_feat, on="product_id", how="left")
                .merge(user_feat, on="user_id",    how="left")
                .merge(up_feat,   on=["user_id","product_id"], how="left")
                .fillna(0))

        # Same column rename as training
        cols = list(c_df.columns); seen = {}; new_cols = []
        for col in cols:
            if col in seen:
                seen[col] += 1
                new_cols.append(f"{col}_user" if col == "reorder_rate" else f"{col}_{seen[col]}")
            else:
                seen[col] = 0; new_cols.append(col)
        c_df.columns = new_cols
        for fc in FEATURE_COLS:
            if fc not in c_df.columns: c_df[fc] = 0.0

        # One predict_proba call for the whole chunk
        c_df["score"] = lgb_model.predict_proba(c_df[FEATURE_COLS].values)[:, 1]

        # Score each user from the chunk results
        for idx, row in chunk.iterrows():
            user_cands = c_df[c_df["idx"] == idx]
            if user_cands.empty:
                recs = []
            else:
                recs = user_cands.nlargest(K, "score")["product_id"].tolist()
            s = score_recs(recs, row["true_items"], K)
            for key in m: m[key].append(s[key])

        del c_df, cand_rows; gc.collect()

    return {k: float(np.mean(v)) for k, v in m.items()}

# ── Run all five strategies at K=5 and K=10 ───────────────────────────────────
STRATEGY_MAP = {
    "S1: Popularity"       : s1_popularity,
    "S2: Personal History" : s2_personal,
    "S3: Assoc Rules"      : s3_rules,
    "S4: Markov Chain"     : s4_markov,
    "S5: LightGBM"         : s5_lgbm,   # inference fn still available for demo use
}

N_EVAL = 1500
all_results = {}

print(f"Evaluating {len(STRATEGY_MAP)} strategies on {N_EVAL} users …\n")
header = f"  {'Strategy':<22}  {'P@5':>6} {'R@5':>6} {'F1@5':>6} {'NDCG@5':>7}  {'P@10':>6} {'R@10':>6} {'F1@10':>6} {'NDCG@10':>8} {'Hit@10':>7}  {'s':>4}"
print(header); print("─"*len(header))

for name, fn in STRATEGY_MAP.items():
    row_parts = [f"  {name:<22}"]
    for k_val in [5, 10]:
        t0 = time.time()
        if name == "S5: LightGBM":
            r = evaluate_lgbm_batched(eval_sample, K=k_val, n=N_EVAL)
        else:
            r = evaluate_strategy(fn, eval_sample, K=k_val, n=N_EVAL)
        dt = time.time() - t0
        all_results[(name, k_val)] = r
        row_parts.append(
            f"  {r['precision']:6.4f} {r['recall']:6.4f} {r['f1']:6.4f} {r['ndcg']:7.4f}")
    row_parts.append(f"  {r['hit_rate']:7.4f}  {dt:4.1f}s")
    print("".join(row_parts))

Evaluating 5 strategies on 1500 users …

  Strategy                   P@5    R@5   F1@5  NDCG@5    P@10   R@10  F1@10  NDCG@10  Hit@10     s
───────────────────────────────────────────────────────────────────────────────────────────────────
  S1: Popularity          0.0544 0.0258 0.0313  0.0603  0.0449 0.0400 0.0380  0.0559   0.3200   0.2s
  S2: Personal History    0.2148 0.1245 0.1355  0.2518  0.1627 0.1744 0.1469  0.2306   0.6927   0.2s
  S3: Assoc Rules         0.1588 0.0996 0.1051  0.1822  0.1444 0.1619 0.1329  0.1891   0.6760   0.2s
  S4: Markov Chain        0.2061 0.1146 0.1275  0.2414  0.1516 0.1588 0.1352  0.2164   0.6593   0.5s
  S5: LightGBM            0.2241 0.1313 0.1428  0.2588  0.1720 0.1871 0.1554  0.2402   0.7233  49.1s

# ── Results table ─────────────────────────────────────────────────────────────
result_rows = []
for (name, k), m in all_results.items():
    result_rows.append({"Strategy": name, "K": k, **m})
results_df = pd.DataFrame(result_rows)
print("=== Full Results Table ===")
display(results_df.sort_values(["K","ndcg"], ascending=[True,False]).reset_index(drop=True))

# ── Grouped bar charts ────────────────────────────────────────────────────────
strategy_names = list(STRATEGY_MAP.keys())
palette = sns.color_palette("husl", len(strategy_names))
metrics_info = [("precision","Precision@K"),("recall","Recall@K"),
                ("f1","F1@K"),("ndcg","NDCG@K"),("hit_rate","Hit Rate@K")]

fig, axes = plt.subplots(2, 3, figsize=(22, 10))
for ax_idx, (metric, label) in enumerate(metrics_info):
    ax = axes[ax_idx//3][ax_idx%3]
    x  = np.arange(len(strategy_names)); w = 0.35
    for ki, k_val in enumerate([5, 10]):
        vals = [all_results[(s, k_val)][metric] for s in strategy_names]
        bars = ax.bar(x + ki*w, vals, w, label=f"K={k_val}",
                      color=[palette[i] for i in range(len(strategy_names))],
                      alpha=0.7+ki*0.2, edgecolor="white")
        for bar, v in zip(bars, vals):
            ax.text(bar.get_x()+bar.get_width()/2, v+0.001,
                    f"{v:.3f}", ha="center", va="bottom", fontsize=7, rotation=45)
    ax.set_xticks(x+w/2)
    ax.set_xticklabels([s.split(":")[0] for s in strategy_names], fontsize=9)
    ax.set_title(label, fontsize=12, fontweight="bold")
    ax.legend(fontsize=8); ax.set_ylim(0, ax.get_ylim()[1]*1.2)

# Feature importance panel
ax = axes[1][2]
fi_top = fi.head(10)
bars = ax.barh(fi_top.index[::-1], fi_top.values[::-1],
               color=sns.color_palette("viridis", 10))
ax.set_title("LightGBM Feature Importance (top 10)", fontsize=12, fontweight="bold")
ax.set_xlabel("Importance")
cart_feats = {"markov_score","assoc_score","pos_prob_next","cart_size"}
for bar, feat in zip(bars[::-1], fi_top.index):
    if feat in cart_feats:
        ax.text(bar.get_width()*0.03, bar.get_y()+bar.get_height()/2,
                "🛒 cart-ctx", va="center", fontsize=7, color="navy")

plt.suptitle("Strategy Benchmark — All Metrics @ K=5 and K=10",
             fontsize=15, fontweight="bold", y=1.01)
plt.tight_layout(); plt.show()

# ── Radar chart ───────────────────────────────────────────────────────────────
radar_metrics = ["precision","recall","f1","ndcg","hit_rate"]
radar_labels  = ["P@10","R@10","F1@10","NDCG@10","Hit@10"]
N = len(radar_metrics)
angles = np.linspace(0, 2*np.pi, N, endpoint=False).tolist(); angles += angles[:1]

fig_r, ax_r = plt.subplots(figsize=(8,8), subplot_kw=dict(polar=True))
colors = sns.color_palette("husl", len(strategy_names))
for i, sname in enumerate(strategy_names):
    vals = [all_results[(sname, 10)][m] for m in radar_metrics]; vals += vals[:1]
    ax_r.plot(angles, vals, linewidth=2, label=sname, color=colors[i])
    ax_r.fill(angles, vals, alpha=0.08, color=colors[i])
ax_r.set_xticks(angles[:-1]); ax_r.set_xticklabels(radar_labels, fontsize=11)
ax_r.set_title("Strategy Profiles @ K=10", fontsize=14, fontweight="bold", pad=20)
ax_r.legend(loc="upper right", bbox_to_anchor=(1.35, 1.15), fontsize=9)
plt.tight_layout(); plt.show()

=== Full Results Table ===

	Strategy	K	precision	recall	f1	ndcg	hit_rate
0	S5: LightGBM	5	0.2241	0.1313	0.1428	0.2588	0.6273
1	S2: Personal History	5	0.2148	0.1245	0.1355	0.2518	0.5980
2	S4: Markov Chain	5	0.2061	0.1146	0.1275	0.2414	0.5780
3	S3: Assoc Rules	5	0.1588	0.0996	0.1051	0.1822	0.4920
4	S1: Popularity	5	0.0544	0.0258	0.0313	0.0603	0.2333
5	S5: LightGBM	10	0.1720	0.1871	0.1554	0.2402	0.7233
6	S2: Personal History	10	0.1627	0.1744	0.1469	0.2306	0.6927
7	S4: Markov Chain	10	0.1516	0.1588	0.1352	0.2164	0.6593
8	S3: Assoc Rules	10	0.1444	0.1619	0.1329	0.1891	0.6760
9	S1: Popularity	10	0.0449	0.0400	0.0380	0.0559	0.3200

png

10. Interactive Prediction Demo & Leaderboard

A unified recommend() call wraps all five strategies behind a single interface. Three progressive demos illustrate the system in action:

Demo A — LightGBM recommendations updating in real time as items are added one by one to the cart
Demo B — Side-by-side strategy comparison on a single user’s full cart, showing which strategy scores the most hits
Demo C — Custom cart specified by product name strings (e.g., "banana", "organic milk"), with output from Markov, Rules, and LightGBM

The section closes with a composite leaderboard ranking all five strategies by an equal-weighted average of all reported metrics.

# Unified recommend() wrapper — delegates to whichever strategy is selected
def recommend(user_id, cart_sequence, K = 10, strategy = "lgbm"):

    """
    strategy: "popularity" | "personal" | "rules" | "markov" | "lgbm"
    Returns a DataFrame: rank, product, department, aisle
    """

    fn_map = {"popularity": s1_popularity,
              "personal": s2_personal,
              "rules": s3_rules,
              "markov": s4_markov,
              "lgbm": s5_lgbm}

    pids = fn_map.get(strategy, s5_lgbm)(user_id, cart_sequence, K=K)
    out  = pd.DataFrame({"rank":range(1,len(pids)+1),"product_id":pids})

    return (out.merge(products[["product_id","product","department","aisle"]], on = "product_id", how = "left")[["rank","product","department","aisle"]])

# Demo A — recommendations update as the cart grows
demo_user = eval_sample["user_id"].iloc[5]
demo_cart = eval_sample["cart_sequence"].iloc[5]
demo_true = eval_sample["true_items"].iloc[5]

print(f"User {demo_user}  |  true next basket ({len(demo_true)} items):")
print([pid_to_name.get(p,"?")[:25] for p in list(demo_true)[:6]], "...\n")
print("="*72)
print(f"  {'Cart':>5}  {'Last item added':<30}  Top-5 recs (LightGBM)  [hits ✅]")
print("="*72)

for reveal in [1, 3, 5, len(demo_cart)]:

    partial = demo_cart[:reveal]
    last    = pid_to_name.get(partial[-1],"?")[:28]
    recs    = s5_lgbm(demo_user, partial, K=5)
    names   = [pid_to_name.get(p,"?")[:16]+("✅" if p in demo_true else "") for p in recs]
    print(f"  {reveal:>5}  {last:<30}  {names}")

User 12578  |  true next basket (2 items):
['hazelnut spread with coco', 'lobster bisque'] ...

========================================================================
   Cart  Last item added                 Top-5 recs (LightGBM)  [hits ✅]
========================================================================
      1  organic simply naked pita ch    ['non fat greek yo', 'april fresh liqu', 'organic salted b', 'boneless skinles', 'mango peach sals']
      3  mango peach salsa               ['non fat greek yo', 'april fresh liqu', 'organic salted b', 'boneless skinles', 'cereal']
      5  organic baby spinach            ['non fat greek yo', 'april fresh liqu', 'organic salted b', 'boneless skinles', 'bag of organic b']
      6  bag of organic bananas          ['non fat greek yo', 'april fresh liqu', 'organic salted b', 'boneless skinles', 'strawberries']

# Demo B — side-by-side strategy comparison on a single user
K_demo = 10
print(f"User {demo_user}  |  Cart size: {len(demo_cart)}  |  True basket: {len(demo_true)} items")
print(f"Cart: {[pid_to_name.get(p,'?')[:18] for p in demo_cart[:4]]}…\n")

comparison = {}

for sname, fn in STRATEGY_MAP.items():

    recs = fn(demo_user, demo_cart, K=K_demo)
    hits = [p for p in recs if p in demo_true]
    comparison[sname] = {"recs":recs,"hits":len(hits),
                         "P": len(hits)/K_demo,
                         "R": len(hits)/len(demo_true) if demo_true else 0}

comp_df = pd.DataFrame({
    sname: [pid_to_name.get(p,"?")[:28]+(" ✅" if p in demo_true else "")
            for p in d["recs"]]
    for sname, d in comparison.items()
})
comp_df.index = [f"Rec {i+1}" for i in range(K_demo)]
display(comp_df)
print("\nP@10 / R@10 per strategy:")
for sname, d in comparison.items():
    print(f"  {sname:<22}  hits={d['hits']}  P@10={d['P']:.2f}  R@10={d['R']:.2f}")

User 12578  |  Cart size: 6  |  True basket: 2 items
Cart: ['organic simply nak', 'wheat thins origin', 'mango peach salsa', 'cereal']…

	S1: Popularity	S2: Personal History	S3: Assoc Rules	S4: Markov Chain	S5: LightGBM
Rec 1	banana	non fat greek yogurt	organic hass avocado	non fat greek yogurt	non fat greek yogurt
Rec 2	organic strawberries	organic salted butter	organic strawberries	boneless skinless chicken br	april fresh liquid fabric so
Rec 3	organic hass avocado	boneless skinless chicken br	organic avocado	organic salted butter	organic salted butter
Rec 4	organic avocado	april fresh liquid fabric so	banana	organic hass avocado	boneless skinless chicken br
Rec 5	large lemon	brioche hamburger buns	organic raspberries	organic strawberries	strawberries
Rec 6	strawberries	cheez it cheddar cracker	large lemon	april fresh liquid fabric so	organic tortilla chips
Rec 7	limes	naked green machine boosted	non fat greek yogurt	organic avocado	cheez it cheddar cracker
Rec 8	organic raspberries	chicken pot pie	organic salted butter	organic raspberries	brioche hamburger buns
Rec 9	organic whole milk	organic spaghetti	boneless skinless chicken br	banana	naked green machine boosted
Rec 10	organic yellow onion	strawberries	april fresh liquid fabric so	naked green machine boosted	raspberries

P@10 / R@10 per strategy:
  S1: Popularity          hits=0  P@10=0.00  R@10=0.00
  S2: Personal History    hits=0  P@10=0.00  R@10=0.00
  S3: Assoc Rules         hits=0  P@10=0.00  R@10=0.00
  S4: Markov Chain        hits=0  P@10=0.00  R@10=0.00
  S5: LightGBM            hits=0  P@10=0.00  R@10=0.00

# Demo C — custom cart specified by product name strings
print("\n=== Custom cart: banana, strawberry, organic milk ===")
demo_u2 = eval_sample["user_id"].iloc[10]
cart_names = ["banana","strawberry","organic milk"]
cart_ids = []

for name in cart_names:

    match = [pid for pname,pid in name_to_pid.items() if name.lower() in pname.lower()]
    if match: cart_ids.append(match[0])

print(f"Resolved: {[pid_to_name.get(p,'?')[:25] for p in cart_ids]}\n")

for strat in ["markov","rules","lgbm"]:

    print(f"── {strat.upper()} ──")
    display(recommend(demo_u2, cart_ids, K=5, strategy=strat))

=== Custom cart: banana, strawberry, organic milk ===
Resolved: ['banana sweet potato organ', 'light strawberry blueberr', 'organic milk']

── MARKOV ──

	rank	product	department	aisle
0	1	happy baby spinachmangoand pear baby food	babies	baby food formula
1	2	blueberry oats and quinoa whole grain snack	babies	baby food formula
2	3	organic banana blueberry baby food puree	babies	baby food formula
3	4	alpine spring water	beverages	water seltzer sparkling water
4	5	berry barley organic baby food	babies	baby food formula

── RULES ──

	rank	product	department	aisle
0	1	happy baby spinachmangoand pear baby food	babies	baby food formula
1	2	blueberry oats and quinoa whole grain snack	babies	baby food formula
2	3	organic banana blueberry baby food puree	babies	baby food formula
3	4	alpine spring water	beverages	water seltzer sparkling water
4	5	berry barley organic baby food	babies	baby food formula

── LGBM ──

	rank	product	department	aisle
0	1	banana	produce	fresh fruits
1	2	organic pearraspberry purple carrot stage pouch	babies	baby food formula
2	3	organic stage broccoli pears peas baby food	babies	baby food formula
3	4	stage spinachapple kale	babies	baby food formula
4	5	peter rabbit organics kale broccoli and mango ...	babies	baby food formula

# Construct final composite leaderboard across all strategies
lb_rows = []

for sname in strategy_names:

    r5  = all_results[(sname, 5)]
    r10 = all_results[(sname, 10)]
    composite = np.mean([r5["precision"],r5["recall"],r5["f1"],r5["ndcg"],
                         r10["precision"],r10["recall"],r10["f1"],r10["ndcg"],
                         r10["hit_rate"]])
    lb_rows.append({"Strategy":sname,
                    "P@5":f"{r5['precision']:.4f}","R@5":f"{r5['recall']:.4f}",
                    "F1@5":f"{r5['f1']:.4f}","NDCG@5":f"{r5['ndcg']:.4f}",
                    "P@10":f"{r10['precision']:.4f}","R@10":f"{r10['recall']:.4f}",
                    "F1@10":f"{r10['f1']:.4f}","NDCG@10":f"{r10['ndcg']:.4f}",
                    "Hit@10":f"{r10['hit_rate']:.4f}",
                    "Composite ↑":round(composite,4)})

lb = (pd.DataFrame(lb_rows)
        .sort_values("Composite ↑", ascending=False)
        .reset_index(drop=True))
lb.index += 1
print("\n🏆  Strategy Leaderboard")
display(lb)
print(f"\n🥇  Winner: {lb.iloc[0]['Strategy']}")
print("\nKey takeaways:")
print("  • Cart-sequence signals (Markov + position) lift recs beyond popularity baselines.")
print("  • LightGBM combines all signal types; typically best NDCG (rewards correct ranking).")
print("  • Association Rules add diversity; Markov is fastest at inference.")
print("  • Cold-start users (no history): fall back to S1 + S3.")

🏆  Strategy Leaderboard

	Strategy	P@5	R@5	F1@5	NDCG@5	P@10	R@10	F1@10	NDCG@10	Hit@10	Composite ↑
1	S5: LightGBM	0.2241	0.1313	0.1428	0.2588	0.1720	0.1871	0.1554	0.2402	0.7233	0.2483
2	S2: Personal History	0.2148	0.1245	0.1355	0.2518	0.1627	0.1744	0.1469	0.2306	0.6927	0.2371
3	S4: Markov Chain	0.2061	0.1146	0.1275	0.2414	0.1516	0.1588	0.1352	0.2164	0.6593	0.2235
4	S3: Assoc Rules	0.1588	0.0996	0.1051	0.1822	0.1444	0.1619	0.1329	0.1891	0.6760	0.2056
5	S1: Popularity	0.0544	0.0258	0.0313	0.0603	0.0449	0.0400	0.0380	0.0559	0.3200	0.0745

🥇  Winner: S5: LightGBM

Key takeaways:
  • Cart-sequence signals (Markov + position) lift recs beyond popularity baselines.
  • LightGBM combines all signal types; typically best NDCG (rewards correct ranking).
  • Association Rules add diversity; Markov is fastest at inference.
  • Cold-start users (no history): fall back to S1 + S3.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Data Guy