InstaCart User Data: A Market Basket Analysis & Recommendation Engine

55 minute read

Published:

This post is a walk-through of a detailed analysis of InstaCart user data, which is used to study shopping trends and create a machine learning interface to make product recommendations to users as they fill their carts.

πŸ›’ Instacart Market Basket Analysis β€” Predictive Deep Dive

This notebook provides an in-depth analysis of the Instacart dataset with a predictive focus.

Data sources: orders.csv Β· order_products__prior.csv Β· order_products__train.csv Β· products.csv Β· aisles.csv Β· departments.csv

Prediction target: Given a user’s full purchase history (prior orders) and optionally their current in-session cart, predict which products they will buy next β€” evaluated against the held-out train orders.


πŸ“‹ Table of Contents

#SectionDescription
1Setup & Data LoadingImport libraries, mount Google Drive, load all 6 CSV files from zip
2Dataset OverviewSchema inspection, null checks, row counts, eval-set distribution
3EDA β€” Orders & TimingWhen do customers shop? Day-of-week, hour-of-day, order cadence
4EDA β€” Products & CategoriesTop products, department breakdown, aisle-level reorder rates
5EDA β€” User BehaviorCart-size distribution, reorder tendencies, user loyalty patterns
6Feature EngineeringProduct-, user-, and user-product interaction features for ML
7Association Rule MiningMarket basket analysis with Apriori; support, confidence, and lift
8Next-Item Prediction Strategies5 strategies β€” from a simple popularity baseline to LightGBM
9Model EvaluationPrecision\@K, Recall\@K, F1\@K, NDCG\@K, Hit Rate\@K benchmarks
10Interactive Demo & LeaderboardLive recommendation demos and final strategy leaderboard

1. Setup & Data Loading

All six Instacart CSV files are read directly from a zip archive on Google Drive. Casting columns to compact int8/int16/int32 dtypes immediately after loading keeps RAM manageable for the 30 M+ row dataset.

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import re
import zipfile
import os
from collections import defaultdict, Counter
from itertools import combinations
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
import warnings
from google.colab import drive
import gc
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (roc_auc_score, average_precision_score, classification_report)
from sklearn.model_selection import GroupShuffleSplit
import time

warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.4f}'.format)
%matplotlib inline
print('βœ… Libraries loaded')
βœ… Libraries loaded
# Mount drive to access data
drive.mount('/content/drive/')
Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
data = {}

# Read files from zip
with zipfile.ZipFile('/content/drive/MyDrive/instacart/instacart.zip', 'r') as z:

    file_list = [f for f in z.namelist() if f.endswith('.csv')]

    print(f'Files in archive: {file_list}\n')

    for fn in file_list:

        key = fn.replace('.csv', '')

        # Update
        data[key] = pd.read_csv(z.open(fn))
        print(f'  βœ“ {fn}: {data[key].shape[0]:,} rows Γ— {data[key].shape[1]} cols')

print('\nβœ… All datasets loaded')
Files in archive: ['aisles.csv', 'departments.csv', 'order_products__prior.csv', 'order_products__train.csv', 'orders.csv', 'products.csv']

  βœ“ aisles.csv: 134 rows Γ— 2 cols
  βœ“ departments.csv: 21 rows Γ— 2 cols
  βœ“ order_products__prior.csv: 32,434,489 rows Γ— 4 cols
  βœ“ order_products__train.csv: 1,384,617 rows Γ— 4 cols
  βœ“ orders.csv: 3,421,083 rows Γ— 7 cols
  βœ“ products.csv: 49,688 rows Γ— 4 cols

βœ… All datasets loaded

Back to top..


2. Dataset Overview

The dataset ships as six files that join on shared keys:

FileKey columnsPurpose
ordersorder_id, user_idOne row per order; includes timing and eval_set label
order_products__priororder_id, product_idEvery item in every prior order
order_products__trainorder_id, product_idItems in the single held-out train order per user
productsproduct_id, aisle_id, department_idProduct names and taxonomy
aislesaisle_idAisle labels
departmentsdepartment_idDepartment labels

The eval_set column in orders partitions every user’s orders into three splits:

  • prior β€” all historical orders used to build user profiles and train models
  • train β€” one order per user that acts as the ground truth for model evaluation
  • test β€” the competition hold-out (no order-product rows provided)
# Inspect schema
print('=' * 65)
print('DATASET SCHEMA & NULL REPORT')
print('=' * 65)
for name, df in data.items():

    nulls = df.isnull().sum()
    null_info = {c: n for c, n in nulls.items() if n > 0}

    print(f'\nπŸ“‚ {name}')
    print(f'   Rows: {df.shape[0]:,}  |  Cols: {df.shape[1]}')
    print(f'   Columns: {list(df.columns)}')

    if null_info:

        print(f'   ⚠️  Nulls: {null_info}')

    else:

        print(f'   βœ“  No nulls')

    display(df.head(3))
=================================================================
DATASET SCHEMA & NULL REPORT
=================================================================

πŸ“‚ aisles
   Rows: 134  |  Cols: 2
   Columns: ['aisle_id', 'aisle']
   βœ“  No nulls
aisle_idaisle
01prepared soups salads
12specialty cheeses
23energy granola bars
πŸ“‚ departments
   Rows: 21  |  Cols: 2
   Columns: ['department_id', 'department']
   βœ“  No nulls
department_iddepartment
01frozen
12other
23bakery
πŸ“‚ order_products__prior
   Rows: 32,434,489  |  Cols: 4
   Columns: ['order_id', 'product_id', 'add_to_cart_order', 'reordered']
   βœ“  No nulls
order_idproduct_idadd_to_cart_orderreordered
023312011
122898521
22932730
πŸ“‚ order_products__train
   Rows: 1,384,617  |  Cols: 4
   Columns: ['order_id', 'product_id', 'add_to_cart_order', 'reordered']
   βœ“  No nulls
order_idproduct_idadd_to_cart_orderreordered
014930211
111110921
211024630
πŸ“‚ orders
   Rows: 3,421,083  |  Cols: 7
   Columns: ['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow', 'order_hour_of_day', 'days_since_prior_order']
   ⚠️  Nulls: {'days_since_prior_order': 206209}
order_iduser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_order
025393291prior128NaN
123987951prior23715.0000
24737471prior331221.0000
πŸ“‚ products
   Rows: 49,688  |  Cols: 4
   Columns: ['product_id', 'product_name', 'aisle_id', 'department_id']
   βœ“  No nulls
product_idproduct_nameaisle_iddepartment_id
01Chocolate Sandwich Cookies6119
12All-Seasons Salt10413
23Robust Golden Unsweetened Oolong Tea947
# Inspect the eval data
eval_counts = data['orders']['eval_set'].value_counts()

print('Orders by eval_set:')

for k, v in eval_counts.items():

    print(f'  {k:8s}: {v:>10,} orders  ({v/len(data["orders"])*100:.1f}%)')

print(f'\nTotal unique users: {data["orders"]["user_id"].nunique():,}')
print(f'Total unique products: {data["products"].shape[0]:,}')
print(f'Total aisles: {data["aisles"].shape[0]}  |  Departments: {data["departments"].shape[0]}')
Orders by eval_set:
  prior   :  3,214,874 orders  (94.0%)
  train   :    131,209 orders  (3.8%)
  test    :     75,000 orders  (2.2%)

Total unique users: 206,209
Total unique products: 49,688
Total aisles: 134  |  Departments: 21

Note on the data split: Every user has exactly one train order and one test order, with all remaining orders labelled prior. Models are trained using prior interactions and evaluated by predicting the contents of the train order. The test split has no product labels and is not used in this notebook.

Back to top..


3. EDA β€” Orders & Timing

Understanding when customers shop reveals weekly and intra-day demand rhythms that carry useful signals for models β€” e.g., a user who always shops on Sunday mornings is likely to continue that pattern. This section also examines how often users return and how their inter-order gap is distributed.

# Normalize product names (lowercase, strip punctuation)
data['products']['product_name'] = (data['products']['product_name'].apply(lambda x: re.sub(r'[^\w\s]', ' ', x.lower())).apply(lambda x: re.sub(r'\d+|  ', '', x).strip()))

# Merge the codes with the labels
products = (data['products'].merge(data['aisles'], on = 'aisle_id', how = 'left').merge(data['departments'], on = 'department_id',  how = 'left').drop(['aisle_id', 'department_id'], axis = 1).rename(columns = {'product_name': 'product'}))
print(f'Products enriched: {products.shape}')
products.head(3)
Products enriched: (49688, 4)
product_idproductaisledepartment
01chocolate sandwich cookiescookies cakessnacks
12all seasons saltspices seasoningspantry
23robust golden unsweetened oolong teateabeverages
# Visualize order volumes
fig, axes = plt.subplots(1, 2, figsize=(20, 5))

dow_labels = {0: 'Sun', 1: 'Mon', 2: 'Tue', 3: 'Wed', 4: 'Thu', 5: 'Fri', 6: 'Sat'}
dow_counts = data['orders']['order_dow'].map(dow_labels).value_counts().reindex(['Sun','Mon','Tue','Wed','Thu','Fri','Sat'])

colors_dow = sns.color_palette('husl', 7)

bars = axes[0].bar(dow_counts.index, dow_counts.values, color = colors_dow, edgecolor = 'white', linewidth = 0.8)
axes[0].set_title('Orders by Day of Week', fontsize = 14, fontweight = 'bold')
axes[0].set_xlabel('Day of Week'); axes[0].set_ylabel('Number of Orders')

# Iterate
for bar, v in zip(bars, dow_counts.values):

    axes[0].text(bar.get_x() + bar.get_width()/2, v + 1500, f'{v:,}', ha = 'center', va = 'bottom', fontsize = 8)

hour_counts = data['orders']['order_hour_of_day'].value_counts().sort_index()
axes[1].plot(hour_counts.index, hour_counts.values, marker = 'o', linewidth = 2.5, color = '#2196F3', markersize = 5)
axes[1].fill_between(hour_counts.index, hour_counts.values, alpha=0.15, color='#2196F3')
axes[1].axvspan(8, 11, alpha=0.1, color='orange', label='Peak morning (8-11)')
axes[1].axvspan(14, 17, alpha=0.1, color='green', label='Peak afternoon (14-17)')
axes[1].set_title('Orders by Hour of Day', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Hour of Day'); axes[1].set_ylabel('Number of Orders')
axes[1].set_xticks(range(0, 24)); axes[1].legend()

plt.suptitle('When Do Customers Shop?', fontsize=16, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

print(f'Peak ordering day:  {dow_counts.idxmax()}    ({dow_counts.max():,} orders)')
print(f'Peak ordering hour: {hour_counts.idxmax()}:00  ({hour_counts.max():,} orders)')

png

Peak ordering day:  Sun    (600,905 orders)
Peak ordering hour: 10:00  (288,418 orders)
# Visualize user data
fig, axes = plt.subplots(1, 2, figsize = (20, 5))

# Days since prior order
dspo = data['orders']['days_since_prior_order'].dropna()

axes[0].hist(dspo, bins = 31, color = '#4CAF50', edgecolor = 'white', alpha = 0.85)
axes[0].axvline(dspo.mean(),   color = 'red',    linestyle = '--', lw = 2, label = f'Mean: {dspo.mean():.1f}d')
axes[0].axvline(dspo.median(), color = 'orange', linestyle = '--', lw = 2, label = f'Median: {dspo.median():.0f}d')
axes[0].axvline(7,  color = 'gray', linestyle = ':', alpha = 0.7, label = 'Weekly')
axes[0].axvline(30, color = 'gray', linestyle = ':', alpha = 0.7, label = 'Monthly')
axes[0].set_title('Days Since Prior Order Distribution', fontsize = 14, fontweight = 'bold')
axes[0].set_xlabel('Days'); axes[0].set_ylabel('Frequency')
axes[0].legend()

# Orders per user
orders_per_user = data['orders'].groupby('user_id')['order_id'].count()
axes[1].hist(orders_per_user, bins=50, color='#9C27B0', edgecolor='white', alpha=0.85)
axes[1].axvline(orders_per_user.mean(), color='red', linestyle='--', lw=2, label=f'Mean: {orders_per_user.mean():.1f}')
axes[1].set_title('Orders per User (all-time)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Number of Orders'); axes[1].set_ylabel('Number of Users')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f'Average days between orders:    {dspo.mean():.1f}')
print(f'Users who order weekly (≀7d):   {(dspo <= 7).mean()*100:.1f}%')
print(f'Users who order monthly (~30d): {(dspo.between(25,35)).mean()*100:.1f}%')
print(f'Users with 10+ orders:          {(orders_per_user >= 10).sum():,} ')
print(f'({(orders_per_user >= 10).mean()*100:.1f}%)')

png

Average days between orders:    11.1
Users who order weekly (≀7d):   50.4%
Users who order monthly (~30d): 14.8%
Users with 10+ orders:          110,728 
(53.7%)
# Truncate the data and track memory
all_user_ids = data['orders']['user_id'].unique()
sampled_uids = set(np.random.default_rng(100).choice(all_user_ids, size = min(150000, len(all_user_ids)), replace = False))
print(f'Working with {len(sampled_uids):,} sampled users')

# Filter the orders index to only sampled users β€” still cheap
orders_slim = data['orders'][data['orders']['user_id'].isin(sampled_uids)][['order_id','user_id','eval_set','order_number', 'order_dow','order_hour_of_day','days_since_prior_order']]

# Cast to smaller dtypes immediately
orders_slim = orders_slim.astype({'order_id':               'int32',
                                  'user_id':                'int32',
                                  'order_number':           'int16',
                                  'order_dow':              'int8',
                                  'order_hour_of_day':      'int8',})

print(f'orders_slim: {orders_slim.shape}  |  {orders_slim.memory_usage(deep = True).sum()/1e6:.1f} MB')

# Get only the order_ids belonging to sampled users
sampled_order_ids = set(orders_slim['order_id'].values)

# Filter prior and train BEFORE any merge
prior_raw = data['order_products__prior'][data['order_products__prior']['order_id'].isin(sampled_order_ids)].astype({'order_id':'int32','product_id':'int32','add_to_cart_order':'int16','reordered':'int8'})

train_raw = data['order_products__train'][data['order_products__train']['order_id'].isin(sampled_order_ids)].astype({'order_id':'int32','product_id':'int32','add_to_cart_order':'int16','reordered':'int8'})

print(f'prior_raw:  {prior_raw.shape[0]:,} rows  |  {prior_raw.memory_usage(deep = True).sum()/1e6:.1f} MB')
print(f'train_raw:  {train_raw.shape[0]:,} rows  |  {train_raw.memory_usage(deep = True).sum()/1e6:.1f} MB')

# Downcast products lookup (tiny, but keeps dtypes consistent)
products_slim = products
products_slim['product_id'] = products_slim['product_id'].astype('int32')

# Enrich prior and train separately β€” small frames now, safe to merge
def enrich(raw_df, eval_label):

    return (raw_df.merge(products_slim, on = 'product_id', how = 'left').merge(orders_slim.drop(columns = 'eval_set'), on = 'order_id', how = 'left').assign(eval_set=eval_label))

prior_op = enrich(prior_raw, 'prior')
del prior_raw; gc.collect()

train_op = enrich(train_raw, 'train')
del train_raw; gc.collect()

orders_master = pd.concat([prior_op, train_op], ignore_index = True)

print(f'\norders_master: {orders_master.shape}')
print(f'  prior rows:  {prior_op.shape[0]:,}')
print(f'  train rows:  {train_op.shape[0]:,}')
print(f'  total RAM:   {orders_master.memory_usage(deep=True).sum()/1e6:.1f} MB')
Working with 150,000 sampled users
orders_slim: (2485904, 7)  |  203.8 MB
prior_raw:  23,525,045 rows  |  447.0 MB
train_raw:  1,007,429 rows  |  19.1 MB

orders_master: (24532474, 13)
  prior rows:  23,525,045
  train rows:  1,007,429
  total RAM:   6748.3 MB

4. EDA β€” Products & Categories

This section examines the product catalogue: which items are ordered most, how order volume breaks down by department, and β€” crucially β€” which categories command the highest reorder rates. High reorder rate is the strongest single predictor used in later models.

# Visualize top products
fig, axes = plt.subplots(1, 3, figsize = (20, 6))

# Top 20 products
top_prod = prior_op['product'].value_counts().head(20)
axes[0].barh(top_prod.index[::-1], top_prod.values[::-1], color = sns.color_palette('viridis', 20))
axes[0].set_title('Top 20 Products (Prior Orders)', fontsize = 12, fontweight = 'bold')
axes[0].set_xlabel('Order Count')

for i, v in enumerate(top_prod.values[::-1]):

    axes[0].text(v + 1000, i, f'{v:,}', va = 'center', fontsize = 7)

# Department breakdown
dept_counts = prior_op['department'].value_counts().head(12)
axes[1].bar(range(len(dept_counts)), dept_counts.values, color = sns.color_palette('rocket', len(dept_counts)))
axes[1].set_xticks(range(len(dept_counts)))
axes[1].set_xticklabels(dept_counts.index, rotation=40, ha='right', fontsize=9)
axes[1].set_title('Orders by Department', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Order Count')

# Reorder rate by department
dept_reorder = prior_op.groupby('department')['reordered'].mean().sort_values(ascending=False).head(12)
bars = axes[2].barh(dept_reorder.index[::-1], dept_reorder.values[::-1], color = sns.color_palette('mako', len(dept_reorder)))
axes[2].axvline(0.5, color='red', linestyle='--', alpha=0.6, label='50% line')
axes[2].set_title('Reorder Rate by Department', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Reorder Rate')
axes[2].legend()

for bar in bars:

    w = bar.get_width()
    axes[2].text(w + 0.005, bar.get_y() + bar.get_height()/2, f'{w:.1%}', va='center', fontsize=8)

plt.suptitle('Product & Category Landscape', fontsize=16, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

print(f'Overall reorder rate: {prior_op["reordered"].mean()*100:.1f}%')
print(f'Highest reorder dept: {dept_reorder.idxmax()} ({dept_reorder.max():.1%})')
print(f'Lowest reorder dept:  {dept_reorder.idxmin()} ({dept_reorder.min():.1%})')

png

Overall reorder rate: 58.9%
Highest reorder dept: dairy eggs (67.0%)
Lowest reorder dept:  breakfast (56.1%)
# Visualize frequency of product categories
aisle_summary = prior_op.groupby('aisle').agg(order_count = ('order_id', 'count'), reorder_rate = ('reordered', 'mean')).reset_index().sort_values('order_count', ascending = False).head(30)

fig, ax = plt.subplots(figsize=(20, 8))

scatter = ax.scatter(aisle_summary['order_count'],
                     aisle_summary['reorder_rate'],
                     c = aisle_summary['reorder_rate'],
                     cmap = 'RdYlGn',
                     s = 100,
                     alpha = 0.8,
                     edgecolors = 'gray',
                     linewidth = 0.5)

for _, row in aisle_summary.iterrows():

    ax.annotate(row['aisle'], (row['order_count'], row['reorder_rate']), fontsize = 7, ha = 'left', va = 'bottom', alpha = 0.85)

plt.colorbar(scatter, ax=ax, label='Reorder Rate')
ax.set_xlabel('Total Orders (Prior)', fontsize=12)
ax.set_ylabel('Reorder Rate', fontsize=12)
ax.set_title('Top 30 Aisles: Volume vs Reorder Rate\n(high-right = popular & habitual)', fontsize = 13, fontweight='bold')
ax.axhline(prior_op['reordered'].mean(), color='gray', linestyle='--', alpha=0.5, label='Avg reorder rate')
ax.legend()
plt.tight_layout()
plt.show()

png

Back to top..


5. EDA β€” User Behavior

Having explored the product side, we now profile the users: how large are their baskets, how habitual are they (do they keep buying the same items?), and how does reorder tendency vary across the user base? These patterns motivate several of the features engineered in Section 6.

# Visualize cart size and user behavior patterns
fig, axes = plt.subplots(1, 3, figsize=(20, 5))

# Cart size distribution
cart_sizes = prior_op.groupby('order_id')['product_id'].count()
axes[0].hist(cart_sizes.clip(upper=50), bins=50, color='#FF5722', edgecolor='white', alpha=0.85)
axes[0].axvline(cart_sizes.mean(), color='black', linestyle='--', lw = 2, label = f'Mean: {cart_sizes.mean():.1f}')
axes[0].set_title('Cart Size Distribution', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Items per Order (capped at 50)'); axes[0].set_ylabel('Orders')
axes[0].legend()

# Reorder rate by cart position
atco = prior_op.groupby('add_to_cart_order')['reordered'].mean()
atco = atco[atco.index <= 30]

axes[1].plot(atco.index, atco.values, marker='o', linewidth=2.5, color='#FF9800')
axes[1].fill_between(atco.index, atco.values, alpha=0.15, color='#FF9800')
axes[1].set_title('Reorder Rate by Cart Position', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Add-to-Cart Position'); axes[1].set_ylabel('Reorder Rate')
axes[1].set_ylim(0, 1)
axes[1].annotate('First items\nare most habitual', xy=(1, atco.iloc[0]), xytext=(5, 0.85), arrowprops=dict(arrowstyle='->', color='gray'), fontsize=9)

# User reorder rate distribution
user_reorder = prior_op.groupby('user_id')['reordered'].mean()
axes[2].hist(user_reorder, bins=40, color='#3F51B5', edgecolor='white', alpha=0.85)
axes[2].axvline(user_reorder.mean(), color='red', linestyle='--', lw=2,
                label=f'Mean: {user_reorder.mean():.2f}')
axes[2].set_title('User Reorder Rate Distribution', fontsize=13, fontweight='bold')
axes[2].set_xlabel('Reorder Rate'); axes[2].set_ylabel('Number of Users')
axes[2].legend()

plt.suptitle('User Behavior Patterns', fontsize=16, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

print(f'Average cart size:                  {cart_sizes.mean():.1f} items')
print(f'Median cart size:                   {cart_sizes.median():.0f} items')
print(f'% users with reorder rate > 50%:    {(user_reorder > 0.5).mean()*100:.1f}%')
print(f'First-position reorder rate:        {atco.iloc[0]:.1%}')

png

Average cart size:                  10.1 items
Median cart size:                   8 items
% users with reorder rate > 50%:    38.4%
First-position reorder rate:        67.7%

Back to top..


6. Feature Engineering


We engineer three tiers of features that capture different aspects of purchasing behaviour:

TierFeaturesCaptures
ProductOrder count, reorder rate, avg/std cart position, global popularity rankHow popular and habitual a product is across all users
UserTotal orders, avg cart size, reorder rate, avg/std days between orders, variety ratioA user’s overall shopping style and cadence
User Γ— ProductTimes ordered, reorder rate, avg cart position, recency gap, reorder scoreHow strongly this user is attached to this product

The reorder_score = (times_ordered Γ— reorder_rate) / (orders_since_last + 1) acts as a recency-weighted loyalty signal and is used in several strategies.

## <a id="section-7"></a>7. Association Rule Mining (Market Basket Analysis)

We apply the **Apriori** algorithm to a stratified sample of prior orders to discover which products are frequently purchased together.

**Key metrics:**
- **Support** β€” fraction of baskets containing the itemset; a measure of how common the pair is
- **Confidence** β€” P(B | A), the probability B is bought given A is already in the basket
- **Lift** β€” confidence Γ· P(B alone); values > 1 indicate a genuine positive association beyond chance

**Sampling strategy:**
1. Draw 50,000 users from the large `prior` set
2. Use only each user's *most recent* prior order (most predictive of future behaviour)
3. Restrict the product space to the top-5,000 items to keep the boolean matrix tractable
4. Set `min_support = 0.01` (at least 1% of sample baskets) so item pairs are discoverable
5. Retain only rules with `lift > 1.0` β€” positive associations only

The resulting rule dictionaries (`rule_lookup`, `rule_single`) are used directly by **Strategy 3** (Association Rules recommender).
Product features shape: (49549, 11)

Top 10 most ordered products:
productdepartmentorder_countreorder_rateunique_usersavg_cart_position
0bananaproduce3424330.8432536984.8967
1bag of organic bananasproduce2745300.8316462195.0874
2organic strawberriesproduce1917240.7767428047.2159
3organic baby spinachproduce1759040.7720401007.3903
4organic hass avocadoproduce1552680.7971314996.7534
5organic avocadoproduce1289420.7588311026.4500
6large lemonproduce1107320.6961336487.9196
7strawberriesproduce1039180.6991312747.1101
8limesproduce1009320.6787324298.5485
9organic raspberriesproduce991920.7678230367.2162
# Derive user features
user_order_enriched = prior_op.merge(data['orders'][['order_id','order_number','days_since_prior_order']], on = 'order_id', how = 'left', suffixes = ('','_o'))

user_stats = user_order_enriched.groupby('user_id').agg(total_orders          = ('order_id',                'nunique'),
                                                        total_items_purchased = ('product_id',              'count'),
                                                        unique_products       = ('product_id',              'nunique'),
                                                        avg_cart_size         = ('order_id', lambda x: x.count() / x.nunique()),
                                                        reorder_rate          = ('reordered',               'mean'),
                                                        avg_days_between      = ('days_since_prior_order',  'mean'),
                                                        std_days_between      = ('days_since_prior_order',  'std'),).reset_index()

user_stats['variety_ratio'] = (user_stats['unique_products'] / user_stats['total_items_purchased'])

print(f'User features shape: {user_stats.shape}')
print('\nUser statistics summary:')
display(user_stats[['total_orders','avg_cart_size','reorder_rate', 'avg_days_between','variety_ratio']].describe())
User features shape: (150000, 9)

User statistics summary:
total_ordersavg_cart_sizereorder_rateavg_days_betweenvariety_ratio
count150000.0000150000.0000150000.0000150000.0000150000.0000
mean15.57279.94310.432015.48210.5680
std16.67645.85160.21217.20990.2121
min3.00001.00000.00000.00000.0105
25%5.00005.75000.26769.55820.4046
50%9.00008.93330.428614.71430.5714
75%19.000013.00000.595420.75000.7324
max99.000070.25000.989530.00001.0000
# Find user/product interactions
up_stats = user_order_enriched.groupby(['user_id','product_id']).agg(times_ordered     = ('order_id',            'count'),
                                                                     times_reordered   = ('reordered',            'sum'),
                                                                     avg_cart_position = ('add_to_cart_order',    'mean'),
                                                                     last_order_num    = ('order_number',          'max')).reset_index()

user_max_order = (data['orders'][data['orders']['eval_set'] == 'prior'].groupby('user_id')['order_number'].max().reset_index().rename(columns = {'order_number': 'max_order_num'}))

up_stats = up_stats.merge(user_max_order, on='user_id', how='left')
up_stats['orders_since_last'] = up_stats['max_order_num'] - up_stats['last_order_num']
up_stats['reorder_rate']      = up_stats['times_reordered'] / up_stats['times_ordered'].clip(lower=1)

# Recency-weighted score: high orders + high reorder rate + low gap
up_stats['reorder_score']     = (up_stats['times_ordered'] * up_stats['reorder_rate']) / (up_stats['orders_since_last'] + 1)

print(f'User-product interaction features: {up_stats.shape}')
print(f'Avg interactions per user: {len(up_stats)/up_stats["user_id"].nunique():.1f} products')
up_stats.head()
User-product interaction features: (9657625, 10)
Avg interactions per user: 64.4 products
user_idproduct_idtimes_orderedtimes_reorderedavg_cart_positionlast_order_nummax_order_numorders_since_lastreorder_ratereorder_score
011961091.4000101000.90009.0000
1110258983.3333101000.88898.0000
2110326105.000051050.00000.0000
31124271093.3000101000.90009.0000
4113032326.3333101000.66672.0000

#Back to top..


7. Association Rule Mining (Market Basket Analysis)

We use Apriori on a stratified sample of prior orders.

Here we:

  1. Sample from a larg prior set (β‰₯5 000 orders)
  2. Use min_support=0.01 (1% of sample baskets) so pairs are discoverable
  3. Filter rules to lift > 1.0 (positive association only)
# Get a sample, increase for more comprehensive rules (takes longer)
rng = np.random.default_rng(100)
sample_uids = rng.choice(prior_op['user_id'].unique(), size = 50000, replace = False)

# Use the most recent prior order per sampled user (most predictive of future)
recent_prior_orders = (data['orders'][(data['orders']['user_id'].isin(sample_uids)) & (data['orders']['eval_set'] == 'prior')].sort_values(['user_id', 'order_number']).groupby('user_id').last().reset_index()[['user_id', 'order_id']])
sample_op = prior_op[prior_op['order_id'].isin(recent_prior_orders['order_id'])]

print(f'Sample baskets:        {sample_op["order_id"].nunique():,}')
print(f'Avg basket size:       {sample_op.groupby("order_id")["product"].count().mean():.1f} items')
print(f'Total unique products: {sample_op["product"].nunique():,}')
Sample baskets:        50,000
Avg basket size:       10.4 items
Total unique products: 30,430
# Reduce the product space by filtering the top products
top_products = (sample_op['product'].value_counts().head(5000).index.tolist())
top_product_set = set(top_products)

# Filter each basket to only include top-N products
filtered_baskets = (sample_op[sample_op['product'].isin(top_product_set)].groupby('order_id')['product'].apply(list).tolist())

# Drop baskets that now have fewer than 2 items (no pairs possible)
filtered_baskets = [b for b in filtered_baskets if len(b) >= 2]

print(f'Baskets after filtering:       {len(filtered_baskets):,}')
print(f'Avg basket size (top-N only):  {np.mean([len(b) for b in filtered_baskets]):.1f} items')
print(f'Product space:                 {len(top_products):,} items')
print(f'\nMin support threshold (1%):    {int(0.01 * len(filtered_baskets)):,} baskets required')
Baskets after filtering:       45,405
Avg basket size (top-N only):  9.1 items
Product space:                 1000 items

Min support threshold (1%):    454 baskets required
# Perform apriori
te = TransactionEncoder()
te_array = te.fit_transform(filtered_baskets)
basket_df = pd.DataFrame(te_array, columns = te.columns_)
# Get frequencies
frequent_itemsets = apriori(basket_df, min_support = .01, use_colnames = True, max_len = 2)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(len)
# Mine the rules
rules = association_rules(frequent_itemsets, metric = 'confidence', min_threshold = 0.05)
rules = rules[rules['lift'] > 1.0].sort_values('lift', ascending = False).reset_index(drop = True)
# Summarize results
print(f'Frequent itemsets: {len(frequent_itemsets):,}')
print(f'  Singletons: {(frequent_itemsets["length"]==1).sum()}')
print(f'  Pairs:      {(frequent_itemsets["length"]==2).sum()}')
print(f'\nAssociation rules (lift > 1): {len(rules):,}')
print('\nTop 20 rules by lift:')
display(rules[['antecedents','consequents','support','confidence','lift']].head(20).assign(antecedents=lambda df: df['antecedents'].apply(lambda x: list(x)[0]), consequents = lambda df: df['consequents'].apply(lambda x: list(x)[0])))
Frequent itemsets: 134
  Singletons: 115
  Pairs:      19

Association rules (lift > 1): 38

Top 20 rules by lift:
antecedentsconsequentssupportconfidencelift
0large lemonlimes0.01150.17463.6959
1limeslarge lemon0.01150.24293.6959
2organic raspberriesorganic strawberries0.01210.26223.1958
3organic strawberriesorganic raspberries0.01210.14743.1958
4bag of organic bananasorganic hass avocado0.01980.15362.5129
5organic hass avocadobag of organic bananas0.01980.32422.5129
6organic fuji applebanana0.01050.38562.4618
7bananaorganic fuji apple0.01050.06712.4618
8organic raspberriesbag of organic bananas0.01430.30952.3986
9bag of organic bananasorganic raspberries0.01430.11062.3986
10organic hass avocadoorganic strawberries0.01170.19132.3316
11organic strawberriesorganic hass avocado0.01170.14262.3316
12bananahoneycrisp apple0.01020.06482.3303
13honeycrisp applebanana0.01020.36502.3303
14organic baby spinachorganic avocado0.01170.14062.2902
15organic avocadoorganic baby spinach0.01170.19052.2902
16organic strawberriesbag of organic bananas0.02310.28132.1807
17bag of organic bananasorganic strawberries0.02310.17892.1807
18organic hass avocadoorganic baby spinach0.01050.17182.0662
19organic baby spinachorganic hass avocado0.01050.12632.0662
# Visualize association rules
fig, axes = plt.subplots(1, 2, figsize=(20, 6))

# Scatter: support Γ— confidence, coloured by lift
sc = axes[0].scatter(rules['support'], rules['confidence'], c = rules['lift'], cmap = 'YlOrRd', alpha = 0.7, s = 40, edgecolors = 'gray', linewidth = 0.3)
plt.colorbar(sc, ax = axes[0], label = 'Lift')
axes[0].set_xlabel('Support'); axes[0].set_ylabel('Confidence')
axes[0].set_title('Support Γ— Confidence (colour = Lift)', fontsize=12, fontweight='bold')

# Top 15 rules by lift β€” horizontal bar
top15 = rules.head(15).copy()
top15['rule_label'] = top15.apply(lambda r: f"{list(r['antecedents'])[0][:22]} β†’ {list(r['consequents'])[0][:22]}", axis=1)
bar_colors = sns.color_palette('YlOrRd', 15)
axes[1].barh(top15['rule_label'][::-1], top15['lift'][::-1], color=bar_colors)
axes[1].axvline(1.0, color='gray', linestyle='--', alpha=0.6, label='Lift = 1 (independent)')
axes[1].set_xlabel('Lift'); axes[1].set_title('Top 15 Rules by Lift', fontsize=12, fontweight='bold')
axes[1].legend(); axes[1].tick_params(axis='y', labelsize=8)

plt.suptitle('Market Basket β€” Association Rules', fontsize=15, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

png

# Build rule lookup dicts (used by prediction model) ant_frozenset β†’ [(confidence, lift, consequent_name)]
rule_lookup = defaultdict(list)

for _, row in rules.iterrows():

    ant = frozenset(row['antecedents'])
    con = list(row['consequents'])[0]
    rule_lookup[ant].append((row['confidence'], row['lift'], con))

# single item β†’ [(conf, lift, consequent, full_antecedent)]
rule_single = defaultdict(list)

for ant, recs in rule_lookup.items():

    for item in ant:

        for conf, lift, con in recs:

            rule_single[item].append((conf, lift, con, ant))

# Helper lookup dicts
pid_to_name = dict(zip(products['product_id'], products['product']))
name_to_pid = dict(zip(products['product'], products['product_id']))
pid_to_dept = dict(zip(products['product_id'], products['department']))

print(f'Rule lookup entries: {len(rule_lookup)}')
print(f'Single-item lookup:  {len(rule_single)} distinct items have rules')
Rule lookup entries: 12
Single-item lookup:  12 distinct items have rules

Back to top..


8. Next-Item Prediction Strategies

Five strategies recommend products as the user builds their cart, each progressively richer in how it uses purchase history and the sequence of items already added:

#StrategyCart-Order SignalML?Cold-start fallback
S1Global Popularityβ€”Noβ€” (is the fallback)
S2Personal Historyβ€”NoFalls back to S1
S3Association RulesCart contentsNoFalls back to S2
S4Sequential Markov ChainTransition + position probsNoFalls back to S2
S5LightGBM (cart-aware)Markov + position featuresβœ…Falls back to S1

All strategies share the same interface:

fn(user_id, cart_sequence, K=10) β†’ [product_id, ...]

where cart_sequence is the ordered list of product_ids added to cart so far. This uniform signature makes it trivial to swap strategies in the evaluation harness and the final demo.

# Develop the ground truth for model scoring
train_truth = (train_op.groupby('user_id')['product_id'].apply(set).reset_index().rename(columns = {'product_id': 'true_items'}))
# Simulate "current shopping session" using the LAST prior order per user,
last_prior_oids = (orders_slim[orders_slim['eval_set'] == 'prior'].sort_values(['user_id', 'order_number']).groupby('user_id')['order_id'].last().reset_index().rename(columns = {'order_id': 'last_oid'}))

last_prior_seq = (prior_op[prior_op['order_id'].isin(last_prior_oids['last_oid'])].sort_values(['order_id', 'add_to_cart_order']).groupby('order_id')['product_id'].apply(list).reset_index().rename(columns = {'order_id': 'last_oid', 'product_id': 'cart_sequence'}))

eval_df = (train_truth.merge(last_prior_oids, on = 'user_id').merge(last_prior_seq, on = 'last_oid', how = 'inner'))
# Reproducible subsample for fast evaluation across all strategies
eval_sample = eval_df.sample(n = min(3000, len(eval_df)), random_state = 100).reset_index(drop = True)

print(f"Evaluation set  : {len(eval_df):,} users")
print(f"Eval sample     : {len(eval_sample):,} users")
print(f"Avg true basket : {eval_sample['true_items'].apply(len).mean():.1f} items")
print(f"Avg cart length : {eval_sample['cart_sequence'].apply(len).mean():.1f} items")
eval_sample[['user_id','cart_sequence','true_items']].head(3)
Evaluation set  : 95,361 users
Eval sample     : 3,000 users
Avg true basket : 10.5 items
Avg cart length : 10.6 items
user_idcart_sequencetrue_items
015188[34915, 47766, 43183, 35406, 49054, 39993, 457...{46656, 34915, 46724, 264, 41290, 35406, 5296,...
123617[41177]{42585, 18740}
213611[38895, 24852, 9076, 16617, 33294, 45007, 2784...{17794, 44359, 2380, 40174, 38895, 1359, 47601...
# ─────────────────────────────────────────────────────────────────────────────
# Strategy 1 β€” Global Popularity Baseline
# ─────────────────────────────────────────────────────────────────────────────
# Recommend the most-ordered products globally.  No personalisation, no cart
# context β€” pure frequency signal.  Serves as the floor baseline.

POPULARITY_POOL = (product_stats.sort_values('order_count', ascending=False)['product_id'].head(300).tolist())

def s1_popularity(user_id, cart_sequence, K = 10, **_):

    """Top-K globally popular products not already in the cart."""
    in_cart = set(cart_sequence)
    return [p for p in POPULARITY_POOL if p not in in_cart][:K]

print("βœ…  S1 (Global Popularity) ready")
print(f"   Pool: top {len(POPULARITY_POOL)} products")
print(f"   Sample recs: {[pid_to_name.get(p,'?')[:25] for p in s1_popularity(0, [], K=5)]}")
βœ…  S1 (Global Popularity) ready
   Pool: top 300 products
   Sample recs: ['banana', 'bag of organic bananas', 'organic strawberries', 'organic baby spinach', 'organic hass avocado']
# ─────────────────────────────────────────────────────────────────────────────
# Strategy 2 β€” Personal History
# ─────────────────────────────────────────────────────────────────────────────
# Rank the user's previously purchased products by reorder_score
# (frequency Γ— reorder-rate / recency gap), then return the top-K
# not yet in the current cart.

user_history = (up_stats.sort_values(['user_id', 'reorder_score'], ascending=[True, False]).groupby('user_id')['product_id'].apply(list).to_dict())

def s2_personal(user_id, cart_sequence, K = 10, **_):

    """Top-K products from user's personal purchase history by reorder_score."""
    in_cart = set(cart_sequence)
    return [p for p in user_history.get(user_id, []) if p not in in_cart][:K]

coverage = sum(1 for uid in eval_sample['user_id'] if uid in user_history)
print("βœ…  S2 (Personal History) ready")
print(f"   History coverage : {coverage}/{len(eval_sample)} ({coverage/len(eval_sample):.1%})")

_u0 = eval_sample['user_id'].iloc[0]
_c0 = eval_sample['cart_sequence'].iloc[0]
print(f"   Sample recs (user {_u0}): {[pid_to_name.get(p,'?')[:22] for p in s2_personal(_u0,_c0,K=3)]}")
βœ…  S2 (Personal History) ready
   History coverage : 3000/3000 (100.0%)
   Sample recs (user 15188): ['sweet red grape tomato', 'banana', 'whole wheat bread']
# ─────────────────────────────────────────────────────────────────────────────
# Strategy 3 β€” Association Rules  (cart-content-aware)
# ─────────────────────────────────────────────────────────────────────────────
# Every item currently in the cart fires its association rules.
# Consequents are scored by the SUM of (confidence Γ— lift) across all
# triggering cart items, then ranked.  Falls back to personal history
# when the rule set yields fewer than K candidates.

# Build a fast name-keyed lookup: item_name β†’ {consequent_name: agg_score}
assoc_lookup = defaultdict(lambda: defaultdict(float))

for ant_name, recs in rule_single.items():

    for conf, lift, con_name, _ in recs:

        assoc_lookup[ant_name][con_name] += conf * lift

def s3_rules(user_id, cart_sequence, K=10, **_):

    """Score candidates via association rules fired by the current cart."""
    in_cart_ids  = set(cart_sequence)
    in_cart_names = {pid_to_name.get(p, '') for p in cart_sequence}

    scores = defaultdict(float)

    for name in in_cart_names:

        for con_name, score in assoc_lookup.get(name, {}).items():

            pid = name_to_pid.get(con_name)

            if pid and pid not in in_cart_ids:

                scores[pid] += score

    ranked = sorted(scores, key=scores.get, reverse=True)

    if len(ranked) >= K:

        return ranked[:K]

    # Backfill with personal history when rules are sparse
    extra = [p for p in s2_personal(user_id, cart_sequence, K * 3) if p not in set(ranked)]

    return (ranked + extra)[:K]

print("βœ…  S3 (Association Rules) ready")
print(f"   Assoc lookup items : {len(assoc_lookup)}")
print(f"   Sample recs : {[pid_to_name.get(p,'?')[:22] for p in s3_rules(_u0,_c0,K = 3)]}")
βœ…  S3 (Association Rules) ready
   Assoc lookup items : 12
   Sample recs : ['banana', 'organic baby spinach', 'sweet red grape tomato']
# ─────────────────────────────────────────────────────────────────────────────
# Strategy 4 β€” Sequential Markov Chain  (O(N) single-pass, no loops)
# ─────────────────────────────────────────────────────────────────────────────
MAX_POS        = 20
MAX_SUCCESSORS = 100
MAX_POS_ITEMS  = 200
MAX_USER_HIST  = 50

print("Building Markov transition matrix…")
t0 = time.time()

# ── Step 1: extract + sort ────────────────────────────────────────────────────
arr  = prior_op[['order_id','product_id','add_to_cart_order']].to_numpy(dtype = 'int32')
arr  = arr[np.lexsort((arr[:,2], arr[:,0]))]

oids = arr[:,0]; pids = arr[:,1]; pos0 = arr[:,2] - 1
del arr; gc.collect()

# ── Step 2: transition pairs ──────────────────────────────────────────────────
same   = oids[:-1] == oids[1:]
prev_p = pids[:-1][same].astype('int32')
next_p = pids[1: ][same].astype('int32')
del same; gc.collect()

# Sort pairs so identical (prev,next) rows are adjacent β†’ count with diff
order  = np.lexsort((next_p, prev_p))
prev_s = prev_p[order]; next_s = next_p[order]
del prev_p, next_p, order; gc.collect()

# Find group boundaries in O(N) β€” no per-node mask scan
boundaries = np.flatnonzero(np.diff(prev_s)) + 1
boundaries = np.concatenate([[0], boundaries, [len(prev_s)]])

# Count runs of identical (prev,next) using diff on next_s within each prev group
pair_counts = np.diff(
    np.concatenate([np.flatnonzero(
        np.concatenate([[True], (prev_s[1:] != prev_s[:-1]) | (next_s[1:] != next_s[:-1]), [True]])
    ), [len(prev_s)]])
)
# Unique (prev,next) pairs
keep = np.concatenate([[True], (prev_s[1:] != prev_s[:-1]) | (next_s[1:] != next_s[:-1])])
u_prev = prev_s[keep]; u_next = next_s[keep]
del prev_s, next_s, keep; gc.collect()

# Group boundaries on the unique pairs
grp_bounds = np.flatnonzero(np.diff(u_prev)) + 1
grp_bounds = np.concatenate([[0], grp_bounds, [len(u_prev)]])

trans_prob = {}
for i in range(len(grp_bounds) - 1):
    lo, hi = grp_bounds[i], grp_bounds[i+1]
    node = int(u_prev[lo])
    c = pair_counts[lo:hi]; t = u_next[lo:hi]
    if len(c) > MAX_SUCCESSORS:
        top = np.argpartition(c, -MAX_SUCCESSORS)[-MAX_SUCCESSORS:]
        c, t = c[top], t[top]
    total = c.sum()
    trans_prob[node] = {int(ti): float(ci/total) for ti, ci in zip(t, c)}

del u_prev, u_next, pair_counts, grp_bounds; gc.collect()
print(f"  trans_prob done  ({len(trans_prob):,} nodes)  {time.time()-t0:.1f}s")

# ── Step 3: position-popularity ───────────────────────────────────────────────
mask  = pos0 < MAX_POS
p_pos = pos0[mask].astype('int32'); p_pid = pids[mask].astype('int32')
del oids, pids, pos0, mask; gc.collect()

order2  = np.lexsort((p_pid, p_pos))
pp_s = p_pos[order2]; pid_s = p_pid[order2]
del p_pos, p_pid, order2; gc.collect()

keep2 = np.concatenate([[True], (pp_s[1:] != pp_s[:-1]) | (pid_s[1:] != pid_s[:-1])])
pc = np.diff(np.concatenate([np.flatnonzero(keep2), [len(pp_s)]]))
u_pos = pp_s[keep2]; u_pid = pid_s[keep2]
del pp_s, pid_s, keep2; gc.collect()

grp2 = np.flatnonzero(np.diff(u_pos)) + 1
grp2 = np.concatenate([[0], grp2, [len(u_pos)]])

pos_prob = {}
for i in range(len(grp2) - 1):
    lo, hi = grp2[i], grp2[i+1]
    pos = int(u_pos[lo])
    c = pc[lo:hi]; t = u_pid[lo:hi]
    if len(c) > MAX_POS_ITEMS:
        top = np.argpartition(c, -MAX_POS_ITEMS)[-MAX_POS_ITEMS:]
        c, t = c[top], t[top]
    total = c.sum()
    pos_prob[pos] = {int(ti): float(ci/total) for ti, ci in zip(t, c)}

del u_pos, u_pid, pc, grp2; gc.collect()

# ── Step 4: per-user reorder-score list (capped) ─────────────────────────────
user_rs_list = (up_stats
                .sort_values('reorder_score', ascending=False)
                .groupby('user_id').head(MAX_USER_HIST)
                .groupby('user_id')
                .apply(lambda g: list(zip(g['product_id'].tolist(),
                                          g['reorder_score'].tolist())))
                .to_dict())

print(f"βœ…  Markov chain built in {time.time()-t0:.1f}s")
print(f"   Source nodes            : {len(trans_prob):,} products")
print(f"   Avg successors per node : {np.mean([len(v) for v in trans_prob.values()]):.1f}")
print(f"   Positions tracked       : 0 – {MAX_POS-1}")
Building Markov transition matrix…
  trans_prob done  (49,411 nodes)  24.8s
βœ…  Markov chain built in 67.9s
   Source nodes            : 49,411 products
   Avg successors per node : 48.3
   Positions tracked       : 0 – 19
# Strategy 4 β€” Sequential Markov Chain inference function
def s4_markov(user_id, cart_sequence, K = 10, n_gram = 2, alpha = 0.4, beta = 0.25, **_):

    """
    Sequential Markov Chain recommender.

    Score(X) = Σ P(next=X | prev ∈ cart[-n_gram:])          [Markov]
             + Ξ± Β· P(X appears at cart position len(cart))   [position]
             + Ξ² Β· reorder_score(user, X)                    [personal]

    Parameters
    ----------
    cart_sequence : list of product_ids in add-to-cart order
    n_gram        : how many trailing items to use as context (default 2)
    alpha         : weight for the position-popularity signal
    beta          : weight for the user's personal reorder score
    """
    in_cart  = set(cart_sequence)
    next_pos = len(cart_sequence)
    context  = cart_sequence[-n_gram:] if len(cart_sequence) >= n_gram else cart_sequence

    scores = defaultdict(float)

    # (A) Markov transition scores from the last n_gram cart items
    for prev in context:
        for nxt, prob in trans_prob.get(prev, {}).items():
            if nxt not in in_cart:
                scores[nxt] += prob

    # (B) Position-specific popularity at the *next* cart slot
    capped_pos = min(next_pos, MAX_POS - 1)
    for pid, prob in pos_prob.get(capped_pos, {}).items():
        if pid not in in_cart:
            scores[pid] += alpha * prob

    # (C) Blend personal purchase affinity
    for pid, rs in user_rs_list.get(user_id, []):
        if pid not in in_cart and rs > 0:
            scores[pid] += beta * float(rs)

    ranked = sorted(scores, key=scores.get, reverse=True)
    if len(ranked) >= K:
        return ranked[:K]

    extra = [p for p in s2_personal(user_id, cart_sequence, K * 2) if p not in set(ranked)]
    return (ranked + extra)[:K]

# Apply function
print("βœ…  S4 (Sequential Markov Chain) ready")
recs4 = s4_markov(_u0, _c0, K=3)
print(f"   Sample recs : {[pid_to_name.get(p,'?')[:22] for p in recs4]}")
print()
print("Position signal demo β€” top 3 products per cart slot:")
print(f"{'Pos':>4}  {'#1 product':<30} {'#2 product':<30} {'#3 product':<30}")
print("─" * 97)

for pos in range(6):
    slot = pos_prob.get(pos, {})
    top3 = sorted(slot, key=slot.get, reverse=True)[:3]
    cols = [f"{pid_to_name.get(p,'?')[:26]} ({slot[p]:.2%})" for p in top3]
    while len(cols) < 3:
        cols.append("β€”")
    print(f"{pos:>4}  {cols[0]:<30} {cols[1]:<30} {cols[2]:<30}")
βœ…  S4 (Sequential Markov Chain) ready
   Sample recs : ['sweet red grape tomato', 'banana', 'whole wheat bread']

Position signal demo β€” top 3 products per cart slot:
 Pos  #1 product                     #2 product                     #3 product                    
─────────────────────────────────────────────────────────────────────────────────────────────────
   0  banana (9.46%)                 bag of organic bananas (6.78%) organic whole milk (2.59%)    
   1  banana (7.48%)                 bag of organic bananas (5.99%) organic strawberries (2.78%)  
   2  banana (6.07%)                 bag of organic bananas (5.05%) organic strawberries (2.89%)  
   3  banana (5.04%)                 bag of organic bananas (4.25%) organic strawberries (2.91%)  
   4  banana (4.33%)                 bag of organic bananas (3.71%) organic strawberries (2.87%)  
   5  banana (3.95%)                 bag of organic bananas (3.29%) organic strawberries (2.77%)  
# ─────────────────────────────────────────────────────────────────────────────
# Strategy 5 β€” LightGBM feature matrix  (chunked, bounded candidate set)
# ─────────────────────────────────────────────────────────────────────────────
# Why previous versions OOM'd:
#   β€’ Full candidate matrix (8000 users Γ— 300+ candidates) = 2.4M rows
#     before a single feature column is added β€” too large to hold in RAM
#     alongside prior_op, trans_prob, etc.
#
# Fix β€” two constraints that bound memory absolutely:
#   1. CHUNK_SIZE users processed at a time; each chunk is del'd before next
#   2. Candidate pool capped: top MAX_PERSONAL history + top MAX_GLOBAL pool
#      β†’ at most (MAX_PERSONAL + MAX_GLOBAL) rows per user, guaranteed

ML_N_USERS   = 3000   # reduce if still tight; 3k is plenty for LightGBM
CHUNK_SIZE   = 200    # users per chunk β€” peak RAM β‰ˆ 200 Γ— 150 Γ— ~20 cols
MAX_PERSONAL = 80     # top items from user history by reorder_score
MAX_GLOBAL   = 80     # top globally popular items
# β†’ max ~160 candidates per user, ~32k rows per chunk

ml_train_users = eval_df['user_id'].sample(n=ML_N_USERS, random_state=42).tolist()
ml_train_df    = (eval_df[eval_df['user_id'].isin(ml_train_users)]
                  .reset_index(drop=True))

# Static lookups built once, outside the loop
prod_feat = product_stats[['product_id','order_count','reorder_rate', 'avg_cart_position','global_popularity']]
prod_feat['pop_rank'] = prod_feat['order_count'].rank(pct = True).astype('float32')

user_feat = user_stats[['user_id','total_orders','avg_cart_size',
                         'reorder_rate','avg_days_between','variety_ratio']]

up_feat = up_stats[['user_id','product_id','times_ordered','reorder_score', 'orders_since_last','avg_cart_position','reorder_rate']]

up_feat.rename(columns={'avg_cart_position':'up_avg_cart_pos', 'reorder_rate':'up_reorder_rate'}, inplace=True)

# Cap global pool
GLOBAL_POOL = POPULARITY_POOL[:MAX_GLOBAL]

# Pre-convert assoc_lookup to pid-keyed dict once (avoids repeated name lookups)
assoc_by_pid = defaultdict(dict)

for ant_name, cons in assoc_lookup.items():

    ant_pid = name_to_pid.get(ant_name)

    if not ant_pid:

        continue

    for con_name, score in cons.items():

        con_pid = name_to_pid.get(con_name)

        if con_pid:

            assoc_by_pid[ant_pid][con_pid] = (assoc_by_pid[ant_pid].get(con_pid, 0.0) + score)

print(f"Users: {ML_N_USERS}  |  chunk: {CHUNK_SIZE}  |  "
      f"max candidates/user: {MAX_PERSONAL + MAX_GLOBAL}")
print(f"Expected rows: ~{ML_N_USERS * (MAX_PERSONAL + MAX_GLOBAL) // 2:,}")

t0 = time.time()
chunks = []

for chunk_start in range(0, len(ml_train_df), CHUNK_SIZE):

    chunk = ml_train_df.iloc[chunk_start : chunk_start + CHUNK_SIZE]
    rows  = []

    for _, row in chunk.iterrows():

        uid      = int(row['user_id'])
        true_set = row['true_items']
        cart_seq = row['cart_sequence']
        in_cart  = set(cart_seq)
        cart_sz  = len(cart_seq)

        # Bounded candidate set
        personal = [p for p in user_history.get(uid, [])[:MAX_PERSONAL]
                    if p not in in_cart]
        globl    = [p for p in GLOBAL_POOL if p not in in_cart and p not in set(personal)]
        candidates = personal + globl

        if not candidates:
            continue

        # Cart-context scores β€” computed once per user, not per candidate
        markov_sc = defaultdict(float)
        for prev in cart_seq:
            for nxt, p in trans_prob.get(prev, {}).items():
                markov_sc[nxt] += p

        assoc_sc = defaultdict(float)
        for ant_pid in cart_seq:
            for con_pid, sc in assoc_by_pid.get(ant_pid, {}).items():
                assoc_sc[con_pid] += sc

        capped_pos = min(cart_sz, MAX_POS - 1)
        pos_p      = pos_prob.get(capped_pos, {})

        for pid in candidates:
            rows.append((uid, pid,
                         int(pid in true_set),
                         cart_sz,
                         markov_sc.get(pid, 0.0),
                         assoc_sc.get(pid, 0.0),
                         pos_p.get(pid, 0.0)))

    if not rows:
        continue

    c_df = pd.DataFrame(rows, columns=['user_id','product_id','label',
                                        'cart_size','markov_score',
                                        'assoc_score','pos_prob_next'])
    c_df = (c_df
            .merge(prod_feat, on='product_id', how='left')
            .merge(user_feat, on='user_id',    how='left')
            .merge(up_feat,   on=['user_id','product_id'], how='left')
            .fillna(0))

    chunks.append(c_df)
    del c_df, rows
    gc.collect()

    if (chunk_start // CHUNK_SIZE) % 5 == 0:
        done = min(chunk_start + CHUNK_SIZE, len(ml_train_df))
        print(f"  {done}/{len(ml_train_df)} users …", end="")

cand_df = pd.concat(chunks, ignore_index=True)
del chunks
gc.collect()

pos_rate = cand_df['label'].mean()
print(f"\nDone in {time.time()-t0:.1f}s  |  rows: {len(cand_df):,}  "
      f"|  positive rate: {pos_rate:.3%}")
cand_df[['user_id','product_id','label','markov_score','assoc_score',
         'pos_prob_next','cart_size','reorder_score']].head(5)
Users: 3000  |  chunk: 200  |  max candidates/user: 160
Expected rows: ~240,000
  200/3000 users …  1200/3000 users …  2200/3000 users …
Done in 80.3s  |  rows: 333,300  |  positive rate: 3.461%
user_idproduct_idlabelmarkov_scoreassoc_scorepos_prob_nextcart_sizereorder_score
01444234200.03940.00000.000060.3333
11441360300.00000.00000.000060.1667
21442296300.07520.00000.003060.0000
31442987100.02330.00000.000060.0000
41443288700.00000.00000.000060.0000
# Define feature columns and train LightGBM with group-aware train/val split
FEATURE_COLS = ['cart_size', 'markov_score', 'assoc_score', 'pos_prob_next', 'order_count', 'reorder_rate', 'avg_cart_position', 'global_popularity', 'pop_rank', 'total_orders', 'avg_cart_size', 'reorder_rate_user', 'avg_days_between', 'variety_ratio', 'times_ordered', 'reorder_score', 'orders_since_last', 'up_avg_cart_pos', 'up_reorder_rate',]

# Some column names duplicated (reorder_rate from product + user) β€” rename defensively
cols = list(cand_df.columns)
seen = {}; new_cols = []

for c in cols:

    if c in seen:

        seen[c] += 1
        new_cols.append(f"{c}_user" if c == "reorder_rate" else f"{c}_{seen[c]}")

    else:

        seen[c] = 0; new_cols.append(c)

cand_df.columns = new_cols

FEATURE_COLS = [c for c in FEATURE_COLS if c in cand_df.columns]

X = cand_df[FEATURE_COLS].values
y = cand_df['label'].values
groups = cand_df['user_id'].values

# Group-aware train/val split (no user leaks between folds)
gss  = GroupShuffleSplit(n_splits=1, test_size=0.15, random_state=42)
tr_idx, va_idx = next(gss.split(X, y, groups))

X_tr, y_tr = X[tr_idx], y[tr_idx]
X_va, y_va = X[va_idx], y[va_idx]

# Scale negatives to 1 positive ~ 4 negatives (handles class imbalance)
pos_weight = max(1, int((y_tr == 0).sum() / max((y_tr == 1).sum(), 1) / 4))

lgb_model = lgb.LGBMClassifier(
    n_estimators    = 400,
    learning_rate   = 0.05,
    num_leaves      = 63,
    max_depth       = -1,
    min_child_samples = 20,
    subsample       = 0.8,
    colsample_bytree= 0.8,
    scale_pos_weight= pos_weight,
    n_jobs          = -1,
    random_state    = 42,
    verbose         = -1,
)

print("Training LightGBM …")
t0 = time.time()
lgb_model.fit(X_tr, y_tr,
              eval_set  = [(X_va, y_va)],
              callbacks = [lgb.early_stopping(30, verbose=False),
                           lgb.log_evaluation(50)])
print(f"Done in {time.time()-t0:.1f}s  |  best iteration: {lgb_model.best_iteration_}")

va_pred = lgb_model.predict_proba(X_va)[:, 1]
print(f"\nVal ROC-AUC  : {roc_auc_score(y_va, va_pred):.4f}")
print(f"Val Avg-Prec : {average_precision_score(y_va, va_pred):.4f}")

# Feature importance
fi = pd.Series(lgb_model.feature_importances_, index=FEATURE_COLS).sort_values(ascending=False)
print("\nTop-10 feature importances:")
print(fi.head(10).to_string())
Training LightGBM …
Done in 1.9s  |  best iteration: 4

Val ROC-AUC  : 0.8648
Val Avg-Prec : 0.2269

Top-10 feature importances:
orders_since_last    32
avg_cart_size        32
markov_score         26
variety_ratio        24
reorder_score        22
order_count          22
total_orders         21
cart_size            19
avg_cart_position    17
avg_days_between     12
# Strategy 5 β€” LightGBM inference function (mirrors training feature pipeline)
def s5_lgbm(user_id, cart_sequence, K=10, **_):
    """LightGBM recommender β€” mirrors the training feature pipeline exactly."""
    in_cart = set(cart_sequence)
    cart_sz = len(cart_sequence)

    # Bounded candidates β€” same caps used at training time
    personal   = [p for p in user_history.get(user_id, [])[:MAX_PERSONAL] if p not in in_cart]
    globl      = [p for p in GLOBAL_POOL if p not in in_cart and p not in set(personal)]
    candidates = personal + globl

    if not candidates:

        return s1_popularity(user_id, cart_sequence, K)

    # Cart-context scores
    markov_sc = defaultdict(float)

    for prev in cart_sequence:

        for nxt, p in trans_prob.get(prev, {}).items():

            markov_sc[nxt] += p

    assoc_sc = defaultdict(float)                    # use pid-keyed dict (matches training)

    for ant_pid in cart_sequence:

        for con_pid, sc in assoc_by_pid.get(ant_pid, {}).items():

            assoc_sc[con_pid] += sc

    pos_p = pos_prob.get(min(cart_sz, MAX_POS - 1), {})

    rows = [{"user_id": user_id, "product_id": int(pid),
             "cart_size": cart_sz,
             "markov_score":  float(markov_sc.get(pid, 0.0)),
             "assoc_score":   float(assoc_sc.get(pid, 0.0)),
             "pos_prob_next": float(pos_p.get(pid, 0.0))}
            for pid in candidates]

    inf_df = (pd.DataFrame(rows)
              .merge(prod_feat, on="product_id", how="left")
              .merge(user_feat, on="user_id",    how="left")
              .merge(up_feat,   on=["user_id","product_id"], how="left")
              .fillna(0))

    # Apply the same column rename as training
    cols = list(inf_df.columns); seen = {}; new_cols = []
    for c in cols:
        if c in seen:
            seen[c] += 1
            new_cols.append(f"{c}_user" if c == "reorder_rate" else f"{c}_{seen[c]}")
        else:
            seen[c] = 0; new_cols.append(c)
    inf_df.columns = new_cols

    for fc in FEATURE_COLS:              # guarantee column alignment
        if fc not in inf_df.columns:
            inf_df[fc] = 0.0

    scores = lgb_model.predict_proba(inf_df[FEATURE_COLS].values)[:, 1]
    inf_df["score"] = scores
    return inf_df.nlargest(K, "score")["product_id"].tolist()

print("βœ…  S5 (LightGBM) ready")
recs5 = s5_lgbm(_u0, _c0, K=3)
print(f"   Sample recs : {[pid_to_name.get(p,'?')[:25] for p in recs5]}")
βœ…  S5 (LightGBM) ready
   Sample recs : ['banana', 'sweet red grape tomatoes', 'whole wheat bread']

Back to top..


9. Model Evaluation

Each strategy is scored on the same held-out eval_sample using five retrieval metrics at K = 5 and K = 10. The evaluation simulates a real shopping session: each user’s last prior order serves as the current cart, and we check whether the strategy’s top-K recommendations intersect with the user’s actual train basket.

MetricFormulaWhat it captures
Precision\@K|recs ∩ true| / KOf K recommendations, what fraction are relevant
Recall\@K|recs ∩ true| / |true|What fraction of the true basket was surfaced
F1\@K2Β·PΒ·R / (P+R)Harmonic mean β€” balanced view of P and R
NDCG\@KDCG / IDCGRanking quality β€” rewards relevant items appearing earlier in the list
Hit Rate\@K1 if any hit else 0Fraction of users who received β‰₯ 1 relevant recommendation
# Scoring functions
def precision_at_k(recs, true_set, k):
    return len(set(recs[:k]) & true_set) / k if k else 0.0

def recall_at_k(recs, true_set, k):
    return len(set(recs[:k]) & true_set) / len(true_set) if true_set else 0.0

def f1_at_k(recs, true_set, k):
    p = precision_at_k(recs, true_set, k)
    r = recall_at_k(recs, true_set, k)
    return 2*p*r/(p+r) if (p+r) > 0 else 0.0

def ndcg_at_k(recs, true_set, k):
    dcg   = sum(1/np.log2(i+2) for i, p in enumerate(recs[:k]) if p in true_set)
    ideal = sum(1/np.log2(i+2) for i in range(min(len(true_set), k)))
    return dcg/ideal if ideal else 0.0

def score_recs(recs, true_set, K):
    return {
        "precision": precision_at_k(recs, true_set, K),
        "recall":    recall_at_k(recs, true_set, K),
        "f1":        f1_at_k(recs, true_set, K),
        "ndcg":      ndcg_at_k(recs, true_set, K),
        "hit_rate":  1 if any(p in true_set for p in recs[:K]) else 0,
    }

def evaluate_strategy(fn, data, K=10, n=None, seed = 100):

    """Standard per-user evaluator β€” used for S1–S4."""
    rows = data if n is None else data.sample(n=n, random_state=seed)
    m = {"precision":[],"recall":[],"f1":[],"ndcg":[],"hit_rate":[]}
    for _, row in rows.iterrows():
        s = score_recs(fn(row["user_id"], row["cart_sequence"], K=K), row["true_items"], K)
        for key in m: m[key].append(s[key])
    return {k: float(np.mean(v)) for k, v in m.items()}

def evaluate_lgbm_batched(eval_data, K = 10, n = None, seed = 100, chunk_size=200):

    """
    Batched evaluator for S5 β€” builds the full feature matrix for
    chunk_size users at once, calls predict_proba once per chunk,
    then scores. Avoids 3000 individual DataFrame merges.
    """
    rows_df = eval_data if n is None else eval_data.sample(n=n, random_state=seed)
    rows_df = rows_df.reset_index(drop=True)
    m = {"precision":[],"recall":[],"f1":[],"ndcg":[],"hit_rate":[]}

    for chunk_start in range(0, len(rows_df), chunk_size):

        chunk    = rows_df.iloc[chunk_start:chunk_start + chunk_size]
        cand_rows = []

        for idx, row in chunk.iterrows():

            uid      = int(row["user_id"])
            cart_seq = row["cart_sequence"]
            in_cart  = set(cart_seq)
            cart_sz  = len(cart_seq)

            personal   = [p for p in user_history.get(uid, [])[:MAX_PERSONAL] if p not in in_cart]
            globl      = [p for p in GLOBAL_POOL if p not in in_cart and p not in set(personal)]
            candidates = personal + globl

            if not candidates:

                candidates = s1_popularity(uid, cart_seq, K)

            markov_sc = defaultdict(float)

            for prev in cart_seq:

                for nxt, p in trans_prob.get(prev, {}).items():

                    markov_sc[nxt] += p

            assoc_sc = defaultdict(float)

            for ant_pid in cart_seq:

                for con_pid, sc in assoc_by_pid.get(ant_pid, {}).items():

                    assoc_sc[con_pid] += sc

            pos_p = pos_prob.get(min(cart_sz, MAX_POS - 1), {})

            for pid in candidates:
                cand_rows.append((idx, uid, int(pid), cart_sz,
                                  float(markov_sc.get(pid, 0.0)),
                                  float(assoc_sc.get(pid, 0.0)),
                                  float(pos_p.get(pid, 0.0))))

        if not cand_rows:
            continue

        c_df = pd.DataFrame(cand_rows, columns=[
            "idx","user_id","product_id","cart_size",
            "markov_score","assoc_score","pos_prob_next"])
        c_df = (c_df
                .merge(prod_feat, on="product_id", how="left")
                .merge(user_feat, on="user_id",    how="left")
                .merge(up_feat,   on=["user_id","product_id"], how="left")
                .fillna(0))

        # Same column rename as training
        cols = list(c_df.columns); seen = {}; new_cols = []
        for col in cols:
            if col in seen:
                seen[col] += 1
                new_cols.append(f"{col}_user" if col == "reorder_rate" else f"{col}_{seen[col]}")
            else:
                seen[col] = 0; new_cols.append(col)
        c_df.columns = new_cols
        for fc in FEATURE_COLS:
            if fc not in c_df.columns: c_df[fc] = 0.0

        # One predict_proba call for the whole chunk
        c_df["score"] = lgb_model.predict_proba(c_df[FEATURE_COLS].values)[:, 1]

        # Score each user from the chunk results
        for idx, row in chunk.iterrows():
            user_cands = c_df[c_df["idx"] == idx]
            if user_cands.empty:
                recs = []
            else:
                recs = user_cands.nlargest(K, "score")["product_id"].tolist()
            s = score_recs(recs, row["true_items"], K)
            for key in m: m[key].append(s[key])

        del c_df, cand_rows; gc.collect()

    return {k: float(np.mean(v)) for k, v in m.items()}
# ── Run all five strategies at K=5 and K=10 ───────────────────────────────────
STRATEGY_MAP = {
    "S1: Popularity"       : s1_popularity,
    "S2: Personal History" : s2_personal,
    "S3: Assoc Rules"      : s3_rules,
    "S4: Markov Chain"     : s4_markov,
    "S5: LightGBM"         : s5_lgbm,   # inference fn still available for demo use
}

N_EVAL = 1500
all_results = {}

print(f"Evaluating {len(STRATEGY_MAP)} strategies on {N_EVAL} users …\n")
header = f"  {'Strategy':<22}  {'P@5':>6} {'R@5':>6} {'F1@5':>6} {'NDCG@5':>7}  {'P@10':>6} {'R@10':>6} {'F1@10':>6} {'NDCG@10':>8} {'Hit@10':>7}  {'s':>4}"
print(header); print("─"*len(header))

for name, fn in STRATEGY_MAP.items():
    row_parts = [f"  {name:<22}"]
    for k_val in [5, 10]:
        t0 = time.time()
        if name == "S5: LightGBM":
            r = evaluate_lgbm_batched(eval_sample, K=k_val, n=N_EVAL)
        else:
            r = evaluate_strategy(fn, eval_sample, K=k_val, n=N_EVAL)
        dt = time.time() - t0
        all_results[(name, k_val)] = r
        row_parts.append(
            f"  {r['precision']:6.4f} {r['recall']:6.4f} {r['f1']:6.4f} {r['ndcg']:7.4f}")
    row_parts.append(f"  {r['hit_rate']:7.4f}  {dt:4.1f}s")
    print("".join(row_parts))
Evaluating 5 strategies on 1500 users …

  Strategy                   P@5    R@5   F1@5  NDCG@5    P@10   R@10  F1@10  NDCG@10  Hit@10     s
───────────────────────────────────────────────────────────────────────────────────────────────────
  S1: Popularity          0.0544 0.0258 0.0313  0.0603  0.0449 0.0400 0.0380  0.0559   0.3200   0.2s
  S2: Personal History    0.2148 0.1245 0.1355  0.2518  0.1627 0.1744 0.1469  0.2306   0.6927   0.2s
  S3: Assoc Rules         0.1588 0.0996 0.1051  0.1822  0.1444 0.1619 0.1329  0.1891   0.6760   0.2s
  S4: Markov Chain        0.2061 0.1146 0.1275  0.2414  0.1516 0.1588 0.1352  0.2164   0.6593   0.5s
  S5: LightGBM            0.2241 0.1313 0.1428  0.2588  0.1720 0.1871 0.1554  0.2402   0.7233  49.1s
# ── Results table ─────────────────────────────────────────────────────────────
result_rows = []
for (name, k), m in all_results.items():
    result_rows.append({"Strategy": name, "K": k, **m})
results_df = pd.DataFrame(result_rows)
print("=== Full Results Table ===")
display(results_df.sort_values(["K","ndcg"], ascending=[True,False]).reset_index(drop=True))

# ── Grouped bar charts ────────────────────────────────────────────────────────
strategy_names = list(STRATEGY_MAP.keys())
palette = sns.color_palette("husl", len(strategy_names))
metrics_info = [("precision","Precision@K"),("recall","Recall@K"),
                ("f1","F1@K"),("ndcg","NDCG@K"),("hit_rate","Hit Rate@K")]

fig, axes = plt.subplots(2, 3, figsize=(22, 10))
for ax_idx, (metric, label) in enumerate(metrics_info):
    ax = axes[ax_idx//3][ax_idx%3]
    x  = np.arange(len(strategy_names)); w = 0.35
    for ki, k_val in enumerate([5, 10]):
        vals = [all_results[(s, k_val)][metric] for s in strategy_names]
        bars = ax.bar(x + ki*w, vals, w, label=f"K={k_val}",
                      color=[palette[i] for i in range(len(strategy_names))],
                      alpha=0.7+ki*0.2, edgecolor="white")
        for bar, v in zip(bars, vals):
            ax.text(bar.get_x()+bar.get_width()/2, v+0.001,
                    f"{v:.3f}", ha="center", va="bottom", fontsize=7, rotation=45)
    ax.set_xticks(x+w/2)
    ax.set_xticklabels([s.split(":")[0] for s in strategy_names], fontsize=9)
    ax.set_title(label, fontsize=12, fontweight="bold")
    ax.legend(fontsize=8); ax.set_ylim(0, ax.get_ylim()[1]*1.2)

# Feature importance panel
ax = axes[1][2]
fi_top = fi.head(10)
bars = ax.barh(fi_top.index[::-1], fi_top.values[::-1],
               color=sns.color_palette("viridis", 10))
ax.set_title("LightGBM Feature Importance (top 10)", fontsize=12, fontweight="bold")
ax.set_xlabel("Importance")
cart_feats = {"markov_score","assoc_score","pos_prob_next","cart_size"}
for bar, feat in zip(bars[::-1], fi_top.index):
    if feat in cart_feats:
        ax.text(bar.get_width()*0.03, bar.get_y()+bar.get_height()/2,
                "πŸ›’ cart-ctx", va="center", fontsize=7, color="navy")

plt.suptitle("Strategy Benchmark β€” All Metrics @ K=5 and K=10",
             fontsize=15, fontweight="bold", y=1.01)
plt.tight_layout(); plt.show()

# ── Radar chart ───────────────────────────────────────────────────────────────
radar_metrics = ["precision","recall","f1","ndcg","hit_rate"]
radar_labels  = ["P@10","R@10","F1@10","NDCG@10","Hit@10"]
N = len(radar_metrics)
angles = np.linspace(0, 2*np.pi, N, endpoint=False).tolist(); angles += angles[:1]

fig_r, ax_r = plt.subplots(figsize=(8,8), subplot_kw=dict(polar=True))
colors = sns.color_palette("husl", len(strategy_names))
for i, sname in enumerate(strategy_names):
    vals = [all_results[(sname, 10)][m] for m in radar_metrics]; vals += vals[:1]
    ax_r.plot(angles, vals, linewidth=2, label=sname, color=colors[i])
    ax_r.fill(angles, vals, alpha=0.08, color=colors[i])
ax_r.set_xticks(angles[:-1]); ax_r.set_xticklabels(radar_labels, fontsize=11)
ax_r.set_title("Strategy Profiles @ K=10", fontsize=14, fontweight="bold", pad=20)
ax_r.legend(loc="upper right", bbox_to_anchor=(1.35, 1.15), fontsize=9)
plt.tight_layout(); plt.show()
=== Full Results Table ===
StrategyKprecisionrecallf1ndcghit_rate
0S5: LightGBM50.22410.13130.14280.25880.6273
1S2: Personal History50.21480.12450.13550.25180.5980
2S4: Markov Chain50.20610.11460.12750.24140.5780
3S3: Assoc Rules50.15880.09960.10510.18220.4920
4S1: Popularity50.05440.02580.03130.06030.2333
5S5: LightGBM100.17200.18710.15540.24020.7233
6S2: Personal History100.16270.17440.14690.23060.6927
7S4: Markov Chain100.15160.15880.13520.21640.6593
8S3: Assoc Rules100.14440.16190.13290.18910.6760
9S1: Popularity100.04490.04000.03800.05590.3200

png

png

Back to top..


10. Interactive Prediction Demo & Leaderboard

A unified recommend() call wraps all five strategies behind a single interface. Three progressive demos illustrate the system in action:

  • Demo A β€” LightGBM recommendations updating in real time as items are added one by one to the cart
  • Demo B β€” Side-by-side strategy comparison on a single user’s full cart, showing which strategy scores the most hits
  • Demo C β€” Custom cart specified by product name strings (e.g., "banana", "organic milk"), with output from Markov, Rules, and LightGBM

The section closes with a composite leaderboard ranking all five strategies by an equal-weighted average of all reported metrics.

# Unified recommend() wrapper β€” delegates to whichever strategy is selected
def recommend(user_id, cart_sequence, K = 10, strategy = "lgbm"):

    """
    strategy: "popularity" | "personal" | "rules" | "markov" | "lgbm"
    Returns a DataFrame: rank, product, department, aisle
    """

    fn_map = {"popularity": s1_popularity,
              "personal": s2_personal,
              "rules": s3_rules,
              "markov": s4_markov,
              "lgbm": s5_lgbm}

    pids = fn_map.get(strategy, s5_lgbm)(user_id, cart_sequence, K=K)
    out  = pd.DataFrame({"rank":range(1,len(pids)+1),"product_id":pids})

    return (out.merge(products[["product_id","product","department","aisle"]], on = "product_id", how = "left")[["rank","product","department","aisle"]])
# Demo A β€” recommendations update as the cart grows
demo_user = eval_sample["user_id"].iloc[5]
demo_cart = eval_sample["cart_sequence"].iloc[5]
demo_true = eval_sample["true_items"].iloc[5]

print(f"User {demo_user}  |  true next basket ({len(demo_true)} items):")
print([pid_to_name.get(p,"?")[:25] for p in list(demo_true)[:6]], "...\n")
print("="*72)
print(f"  {'Cart':>5}  {'Last item added':<30}  Top-5 recs (LightGBM)  [hits βœ…]")
print("="*72)

for reveal in [1, 3, 5, len(demo_cart)]:

    partial = demo_cart[:reveal]
    last    = pid_to_name.get(partial[-1],"?")[:28]
    recs    = s5_lgbm(demo_user, partial, K=5)
    names   = [pid_to_name.get(p,"?")[:16]+("βœ…" if p in demo_true else "") for p in recs]
    print(f"  {reveal:>5}  {last:<30}  {names}")
User 12578  |  true next basket (2 items):
['hazelnut spread with coco', 'lobster bisque'] ...

========================================================================
   Cart  Last item added                 Top-5 recs (LightGBM)  [hits βœ…]
========================================================================
      1  organic simply naked pita ch    ['non fat greek yo', 'april fresh liqu', 'organic salted b', 'boneless skinles', 'mango peach sals']
      3  mango peach salsa               ['non fat greek yo', 'april fresh liqu', 'organic salted b', 'boneless skinles', 'cereal']
      5  organic baby spinach            ['non fat greek yo', 'april fresh liqu', 'organic salted b', 'boneless skinles', 'bag of organic b']
      6  bag of organic bananas          ['non fat greek yo', 'april fresh liqu', 'organic salted b', 'boneless skinles', 'strawberries']
# Demo B β€” side-by-side strategy comparison on a single user
K_demo = 10
print(f"User {demo_user}  |  Cart size: {len(demo_cart)}  |  True basket: {len(demo_true)} items")
print(f"Cart: {[pid_to_name.get(p,'?')[:18] for p in demo_cart[:4]]}…\n")

comparison = {}

for sname, fn in STRATEGY_MAP.items():

    recs = fn(demo_user, demo_cart, K=K_demo)
    hits = [p for p in recs if p in demo_true]
    comparison[sname] = {"recs":recs,"hits":len(hits),
                         "P": len(hits)/K_demo,
                         "R": len(hits)/len(demo_true) if demo_true else 0}

comp_df = pd.DataFrame({
    sname: [pid_to_name.get(p,"?")[:28]+(" βœ…" if p in demo_true else "")
            for p in d["recs"]]
    for sname, d in comparison.items()
})
comp_df.index = [f"Rec {i+1}" for i in range(K_demo)]
display(comp_df)
print("\nP@10 / R@10 per strategy:")
for sname, d in comparison.items():
    print(f"  {sname:<22}  hits={d['hits']}  P@10={d['P']:.2f}  R@10={d['R']:.2f}")
User 12578  |  Cart size: 6  |  True basket: 2 items
Cart: ['organic simply nak', 'wheat thins origin', 'mango peach salsa', 'cereal']…
S1: PopularityS2: Personal HistoryS3: Assoc RulesS4: Markov ChainS5: LightGBM
Rec 1banananon fat greek yogurtorganic hass avocadonon fat greek yogurtnon fat greek yogurt
Rec 2organic strawberriesorganic salted butterorganic strawberriesboneless skinless chicken brapril fresh liquid fabric so
Rec 3organic hass avocadoboneless skinless chicken brorganic avocadoorganic salted butterorganic salted butter
Rec 4organic avocadoapril fresh liquid fabric sobananaorganic hass avocadoboneless skinless chicken br
Rec 5large lemonbrioche hamburger bunsorganic raspberriesorganic strawberriesstrawberries
Rec 6strawberriescheez it cheddar crackerlarge lemonapril fresh liquid fabric soorganic tortilla chips
Rec 7limesnaked green machine boostednon fat greek yogurtorganic avocadocheez it cheddar cracker
Rec 8organic raspberrieschicken pot pieorganic salted butterorganic raspberriesbrioche hamburger buns
Rec 9organic whole milkorganic spaghettiboneless skinless chicken brbanananaked green machine boosted
Rec 10organic yellow onionstrawberriesapril fresh liquid fabric sonaked green machine boostedraspberries
P@10 / R@10 per strategy:
  S1: Popularity          hits=0  P@10=0.00  R@10=0.00
  S2: Personal History    hits=0  P@10=0.00  R@10=0.00
  S3: Assoc Rules         hits=0  P@10=0.00  R@10=0.00
  S4: Markov Chain        hits=0  P@10=0.00  R@10=0.00
  S5: LightGBM            hits=0  P@10=0.00  R@10=0.00
# Demo C β€” custom cart specified by product name strings
print("\n=== Custom cart: banana, strawberry, organic milk ===")
demo_u2 = eval_sample["user_id"].iloc[10]
cart_names = ["banana","strawberry","organic milk"]
cart_ids = []

for name in cart_names:

    match = [pid for pname,pid in name_to_pid.items() if name.lower() in pname.lower()]
    if match: cart_ids.append(match[0])

print(f"Resolved: {[pid_to_name.get(p,'?')[:25] for p in cart_ids]}\n")

for strat in ["markov","rules","lgbm"]:

    print(f"── {strat.upper()} ──")
    display(recommend(demo_u2, cart_ids, K=5, strategy=strat))
=== Custom cart: banana, strawberry, organic milk ===
Resolved: ['banana sweet potato organ', 'light strawberry blueberr', 'organic milk']

── MARKOV ──
rankproductdepartmentaisle
01happy baby spinachmangoand pear baby foodbabiesbaby food formula
12blueberry oats and quinoa whole grain snackbabiesbaby food formula
23organic banana blueberry baby food pureebabiesbaby food formula
34alpine spring waterbeverageswater seltzer sparkling water
45berry barley organic baby foodbabiesbaby food formula
── RULES ──
rankproductdepartmentaisle
01happy baby spinachmangoand pear baby foodbabiesbaby food formula
12blueberry oats and quinoa whole grain snackbabiesbaby food formula
23organic banana blueberry baby food pureebabiesbaby food formula
34alpine spring waterbeverageswater seltzer sparkling water
45berry barley organic baby foodbabiesbaby food formula
── LGBM ──
rankproductdepartmentaisle
01bananaproducefresh fruits
12organic pearraspberry purple carrot stage pouchbabiesbaby food formula
23organic stage broccoli pears peas baby foodbabiesbaby food formula
34stage spinachapple kalebabiesbaby food formula
45peter rabbit organics kale broccoli and mango ...babiesbaby food formula
# Construct final composite leaderboard across all strategies
lb_rows = []

for sname in strategy_names:

    r5  = all_results[(sname, 5)]
    r10 = all_results[(sname, 10)]
    composite = np.mean([r5["precision"],r5["recall"],r5["f1"],r5["ndcg"],
                         r10["precision"],r10["recall"],r10["f1"],r10["ndcg"],
                         r10["hit_rate"]])
    lb_rows.append({"Strategy":sname,
                    "P@5":f"{r5['precision']:.4f}","R@5":f"{r5['recall']:.4f}",
                    "F1@5":f"{r5['f1']:.4f}","NDCG@5":f"{r5['ndcg']:.4f}",
                    "P@10":f"{r10['precision']:.4f}","R@10":f"{r10['recall']:.4f}",
                    "F1@10":f"{r10['f1']:.4f}","NDCG@10":f"{r10['ndcg']:.4f}",
                    "Hit@10":f"{r10['hit_rate']:.4f}",
                    "Composite ↑":round(composite,4)})

lb = (pd.DataFrame(lb_rows)
        .sort_values("Composite ↑", ascending=False)
        .reset_index(drop=True))
lb.index += 1
print("\nπŸ†  Strategy Leaderboard")
display(lb)
print(f"\nπŸ₯‡  Winner: {lb.iloc[0]['Strategy']}")
print("\nKey takeaways:")
print("  β€’ Cart-sequence signals (Markov + position) lift recs beyond popularity baselines.")
print("  β€’ LightGBM combines all signal types; typically best NDCG (rewards correct ranking).")
print("  β€’ Association Rules add diversity; Markov is fastest at inference.")
print("  β€’ Cold-start users (no history): fall back to S1 + S3.")
πŸ†  Strategy Leaderboard
StrategyP@5R@5F1@5NDCG@5P@10R@10F1@10NDCG@10Hit@10Composite ↑
1S5: LightGBM0.22410.13130.14280.25880.17200.18710.15540.24020.72330.2483
2S2: Personal History0.21480.12450.13550.25180.16270.17440.14690.23060.69270.2371
3S4: Markov Chain0.20610.11460.12750.24140.15160.15880.13520.21640.65930.2235
4S3: Assoc Rules0.15880.09960.10510.18220.14440.16190.13290.18910.67600.2056
5S1: Popularity0.05440.02580.03130.06030.04490.04000.03800.05590.32000.0745
πŸ₯‡  Winner: S5: LightGBM

Key takeaways:
  β€’ Cart-sequence signals (Markov + position) lift recs beyond popularity baselines.
  β€’ LightGBM combines all signal types; typically best NDCG (rewards correct ranking).
  β€’ Association Rules add diversity; Markov is fastest at inference.
  β€’ Cold-start users (no history): fall back to S1 + S3.

Back to top..