MovieLens User Data: Machine Learning Recommendation Engine

55 minute read

Published:

In this post, we take the MovieLens 32M User data and build a series of recommendation algorithms that are then combined to yield a master ensemble strategy which ens up winning in terms of performance metrics scores

🎬 MovieLens 32M β€” Two-Stage Generate-&-Rerank Recommendation Ensemble Model to Generate Recommendations

Data: Download ml-32m.zip from https://grouplens.org/datasets/movielens/32m/, mount your Google Drive, and point ZIP_PATH to the archive. All other paths are derived automatically.

In the last study we examined five standalone recommendation strategies applied to InstaCart shopping data.

Here we assess a combined ensemble strategy β€” S6 β€” that draws on all five priors simultaneously to produce more accurate top-N movie recommendations.


πŸ“‹ Table of Contents

#Section
1Setup & data loading
2Exploratory data analysis
3Feature engineering
4Item-based collaborative filtering
5Temporal sequence modelling
6Evaluation harness & S1–S5 strategies
7S6 architecture & cold-start gate
8Stage 1 β€” Expanded candidate pool
9Stage 2 β€” Meta-ranker training
10Full benchmark: S1 β†’ S7
11Visualisations
12Leaderboard & takeaways

1. Setup & data loading

πŸ“– Dataset & Problem Framing

What the data is. The MovieLens 32M dataset contains 32,000,204 explicit 5-star ratings (in 0.5-star increments) and 2,000,072 tag applications across 87,585 movies by 200,948 users, collected between January 1995 and October 2023.

FileKey columnsRole
ratings.csvuserId, movieId, rating, timestampPrimary signal
movies.csvmovieId, title, genresItem metadata
tags.csvuserId, movieId, tag, timestampAuxiliary text signal
links.csvmovieId, imdbId, tmdbIdExternal enrichment (optional)

Task framing. Given a user’s historical rating behaviour, predict which movies they will rate highly (β‰₯ 4.0 stars) that they have not yet seen. Ratings below 4.0 are treated as implicit negatives; β‰₯ 4.0 as positives.

Train/test split strategy. For each user, all ratings before the 90th-percentile timestamp are training data; ratings at or after form the ground truth. This preserves the causal structure and avoids leakage.

Memory strategy. Downcast userId / movieId to int32, rating to float32. Free large intermediates with del + gc.collect() after their last usage point.

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import zipfile, gc, time, warnings, os
from collections import defaultdict, Counter
from itertools import combinations
import lightgbm as lgb
from xgboost import XGBClassifier
from sklearn.model_selection import GroupShuffleSplit, train_test_split
from sklearn.metrics import roc_auc_score, average_precision_score
from scipy.sparse import csr_matrix
from sklearn.preprocessing import normalize
from google.colab import drive
from matplotlib.gridspec import GridSpec
from joblib import Parallel, delayed

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
pd.set_option('display.float_format', '{:.4f}'.format)
pd.set_option('display.max_columns', None)
print('βœ…  Libraries loaded')
βœ…  Libraries loaded
# Read the data
drive.mount('/content/drive/')
Mounted at /content/drive/
# Define functions to compute metrics
def precision_at_k(recs, true_set, k):

    return len(set(recs[:k]) & true_set) / k if k else 0.0

def recall_at_k(recs, true_set, k):

    return len(set(recs[:k]) & true_set) / len(true_set) if true_set else 0.0

def f1_at_k(recs, true_set, k):

    p = precision_at_k(recs, true_set, k)
    r = recall_at_k(recs, true_set, k)

    return 2*p*r/(p+r) if (p+r) > 0 else 0.0

def ndcg_at_k(recs, true_set, k):

    dcg   = sum(1/np.log2(i+2) for i, m in enumerate(recs[:k]) if m in true_set)
    ideal = sum(1/np.log2(i+2) for i in range(min(len(true_set), k)))

    return dcg/ideal if ideal else 0.0

def score_recs(recs, true_set, K):

    return {
        'precision': precision_at_k(recs, true_set, K),
        'recall'   : recall_at_k(recs,    true_set, K),
        'f1'       : f1_at_k(recs,        true_set, K),
        'ndcg'     : ndcg_at_k(recs,      true_set, K),
        'hit_rate' : 1 if any(m in true_set for m in recs[:K]) else 0,
    }

def evaluate_strategy(fn, data_df, K=10, n=None, seed=42):

    rows = data_df if n is None else data_df.sample(n=n, random_state=seed)
    m = {k: [] for k in ('precision','recall','f1','ndcg','hit_rate')}

    for _, row in rows.iterrows():

        s = score_recs(fn(row['userId'], K=K), row['true_items'], K)

        for key in m: m[key].append(s[key])

    return {k: float(np.mean(v)) for k, v in m.items()}

# ── Lookup caches ─────────────────────────────────────────────────────────
# _seen_cache  : user_id -> set of seen movie IDs
# _s2_score_cache : user_id -> precomputed S2 genre score vector (n_movies,)
# _s4_score_cache : user_id -> precomputed S4 taste score vector (n_movies,)
#
# All three are initialised empty here and populated before they are needed.
# When a score cache is empty (e.g. during training pair construction, which
# uses its own batch arrays), the functions fall back to computing on the fly.
_seen_cache     = {}
_s2_score_cache = {}
_s4_score_cache = {}

def get_seen(user_id):

    return _seen_cache.get(user_id, set())

# Function to compute global popularity
def s1_popularity(user_id, K=10, **_):

    seen = get_seen(user_id)

    return [m for m in POPULARITY_POOL if m not in seen][:K]

# Function to compute genre-aware personal history
def s2_personal(user_id, K=10, **_):

    seen = get_seen(user_id)

    if user_id not in user_genre_affinity.index:

        return s1_popularity(user_id, K=K)

    # Use precomputed score vector when available (evaluation path); fall back to on-the-fly matmul otherwise (any ad-hoc call).
    if user_id in _s2_score_cache:

        scores = _s2_score_cache[user_id]

    else:

        u_genre = user_genre_affinity.loc[user_id].values
        u_norm  = np.linalg.norm(u_genre)

        if u_norm == 0:

            return s1_popularity(user_id, K=K)

        scores = movie_genre_norm @ (u_genre / u_norm)

    top_k  = np.argpartition(scores, -K)[-K:]
    top_k  = top_k[np.argsort(scores[top_k])[::-1]]
    ranked = [genre_index_order[i] for i in top_k if genre_index_order[i] not in seen]

    if len(ranked) >= K:

        return ranked[:K]

    # Pad with additional candidates if top-K after filtering is short
    all_ranked = [genre_index_order[i] for i in np.argsort(scores)[::-1] if genre_index_order[i] not in seen]

    return all_ranked[:K]

# Function to compute collaborative filtering
def s3_item_cf(user_id, K=10, **_):

    seen  = get_seen(user_id)
    liked = set(user_liked.get(user_id, []))

    if not liked:

        return s1_popularity(user_id, K=K)

    scores = defaultdict(float)

    for mid in liked:

        for neighbour, sim in item_sim_lookup.get(mid, []):

            if neighbour not in seen:

                scores[neighbour] += sim
    ranked = sorted(scores, key=scores.get, reverse=True)

    if len(ranked) >= K:

        return ranked[:K]

    extra = [m for m in s2_personal(user_id, K=K*3) if m not in set(ranked)]
    return (ranked + extra)[:K]

# Function to compute temporal taste drift
def s4_temporal(user_id, K=10, **_):

    seen = get_seen(user_id)
    # Use precomputed score vector when available (evaluation path); fall back to on-the-fly matmul otherwise.
    if user_id in _s4_score_cache:

        sims = _s4_score_cache[user_id]

    else:

        taste_vec = user_taste_vec.get(user_id)

        if taste_vec is None:

            return s1_popularity(user_id, K=K)

        sims = movie_genre_norm @ taste_vec

    top_k  = np.argpartition(sims, -K)[-K:]
    top_k  = top_k[np.argsort(sims[top_k])[::-1]]
    ranked = [genre_index_order[i] for i in top_k if genre_index_order[i] not in seen]

    if len(ranked) >= K:

        return ranked[:K]

    all_ranked = [genre_index_order[i] for i in np.argsort(sims)[::-1] if genre_index_order[i] not in seen]

    return all_ranked[:K]

# Function to build features
def build_lgb_features(user_id, candidate_movies):

    if not candidate_movies:

        return pd.DataFrame(columns = FEATURE_COLS)

    mf = movie_feat.reindex(candidate_movies)
    mg = movie_genre.reindex(candidate_movies).fillna(0).values.astype('float32')

    mg_norms = np.linalg.norm(mg, axis=1, keepdims=True).clip(min=1e-9)
    mg_norm  = mg / mg_norms

    taste_v        = user_taste_vec.get(user_id, np.zeros(len(genre_cols), dtype='float32'))
    genre_affinity = mg_norm @ taste_v

    u_row = user_stats.loc[user_id] if user_id in user_stats.index else None

    return pd.DataFrame({'userId'            : user_id,
                         'movieId'           : candidate_movies,
                         'u_rating_count'    : float(u_row['rating_count'])   if u_row is not None else 0.0,
                         'u_mean_rating'     : float(u_row['mean_rating'])    if u_row is not None else GLOBAL_MEAN,
                         'u_rating_freq'     : float(u_row['rating_freq'])    if u_row is not None else 0.0,
                         'u_rating_std'      : float(u_row['rating_std'])     if u_row is not None else 0.0,
                         'm_global_count'    : mf['global_count'].fillna(0).values,
                         'm_log_count'       : mf['log_count'].fillna(0).values,
                         'm_global_mean'     : mf['global_mean'].fillna(GLOBAL_MEAN).values,
                         'm_bayesian'        : mf['bayesian_score'].fillna(GLOBAL_MEAN).values,
                         'm_rating_variance' : mf['rating_variance'].fillna(0).values,
                         'm_tag_count'       : mf['tag_count'].fillna(0).values,
                         'genre_affinity'    : genre_affinity,})

# Function to perform lgb recommendations
def s5_lgbm(user_id, K = 10, n_candidates = 200, **_):

    seen = get_seen(user_id)
    candidates = list(dict.fromkeys(s2_personal(user_id, K = n_candidates) + s3_item_cf( user_id, K = n_candidates) + s4_temporal(user_id, K=n_candidates)))

    candidates = [m for m in candidates if m not in seen][:n_candidates]

    if not candidates:

        return s1_popularity(user_id, K=K)

    feat_df = build_lgb_features(user_id, candidates)
    feat_df['score'] = lgb_model.predict(feat_df[FEATURE_COLS])

    return feat_df.sort_values('score', ascending = False)['movieId'].tolist()[:K]

# Function to build second-stage features
def build_stage2_features(user_id, candidates, s2_list, s3_list, s4_list):

    s2r = {m: r for r, m in enumerate(s2_list)}
    s3r = {m: r for r, m in enumerate(s3_list)}
    s4r = {m: r for r, m in enumerate(s4_list)}

    base = build_lgb_features(user_id, candidates)

    base['in_s2']       = base['movieId'].isin(s2r).astype('int8')
    base['in_s3']       = base['movieId'].isin(s3r).astype('int8')
    base['in_s4']       = base['movieId'].isin(s4r).astype('int8')
    base['rank_s2']     = base['movieId'].map(s2r).fillna(N_STAGE1).astype('int16')
    base['rank_s3']     = base['movieId'].map(s3r).fillna(N_STAGE1).astype('int16')
    base['rank_s4']     = base['movieId'].map(s4r).fillna(N_STAGE1).astype('int16')
    base['n_retrievers']= base[['in_s2','in_s3','in_s4']].sum(axis=1).astype('int8')
    base['s5_score']    = lgb_model.predict(base[FEATURE_COLS]).astype('float32')

    return base

# Create a check to see if there is user history
def is_cold(user_id):

    return len(user_liked.get(user_id, [])) < COLD_START_THRESHOLD

# Function to get stage 1 candidates
def stage1_candidates(user_id):

    seen = get_seen(user_id)
    pool = list(dict.fromkeys(s2_personal(user_id, K=N_STAGE1) + s3_item_cf( user_id, K=N_STAGE1) + s4_temporal(user_id, K = N_STAGE1)))

    return [m for m in pool if m not in seen]

# Function to get ensemble recommendations
def s7_ensemble(user_id, K = 10, model = 'lgb', **_):

    if is_cold(user_id):

        return s1_popularity(user_id, K = K)

    s2l = s2_personal(user_id, K=N_STAGE1)
    s3l = s3_item_cf( user_id, K=N_STAGE1)
    s4l = s4_temporal(user_id, K=N_STAGE1)
    candidates = list(dict.fromkeys(s2l + s3l + s4l))

    if not candidates:

        return s1_popularity(user_id, K = K)

    feat = build_stage2_features(user_id, candidates, s2l, s3l, s4l)

    if model == 'lgb':

        feat['score'] = meta_lgb.predict(feat[STAGE2_FEATURE_COLS])

    else:

        feat['score'] = meta_xgb.predict_proba(feat[STAGE2_FEATURE_COLS])[:, 1]

    return feat.sort_values('score', ascending=False)['movieId'].tolist()[:K]

def s6_xgboost(user_id, K = 10, **_):

    return s7_ensemble(user_id, K = K, model = 'xgb')


# Load the data
data = {}

files = [i for i in  os.listdir('/content/drive/MyDrive/ml-32m/')]
file_list = [f for f in files if f.endswith('.csv')]
print(f'Files in archive: {file_list}\n')

for fn in file_list:

    key = fn.replace('.csv', '')
    data[key] = pd.read_csv(f"/content/drive/MyDrive/ml-32m/{fn}")
    print(f'  βœ“ {fn}: {data[key].shape[0]:,} rows Γ— {data[key].shape[1]} cols')

print('\nβœ…  All datasets loaded')
Files in archive: ['ratings.csv', 'tags.csv', 'movies.csv', 'links.csv']

  βœ“ ratings.csv: 32,000,204 rows Γ— 4 cols
  βœ“ tags.csv: 2,000,072 rows Γ— 4 cols
  βœ“ movies.csv: 87,585 rows Γ— 3 cols
  βœ“ links.csv: 87,585 rows Γ— 3 cols

βœ…  All datasets loaded
# Downcast to save ~40% RAM
for col in ['userId', 'movieId']:

    data['ratings'][col] = data['ratings'][col].astype('int32')
    data['tags'][col]    = data['tags'][col].astype('int32')

data['movies']['movieId'] = data['movies']['movieId'].astype('int32')
data['ratings']['rating'] = data['ratings']['rating'].astype('float32')

# Extract the data into objects
ratings = data['ratings']
movies = data['movies']
tags = data['tags']
movies.head()
movieIdtitlegenres
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
12Jumanji (1995)Adventure|Children|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama|Romance
45Father of the Bride Part II (1995)Comedy
tags.head()
userIdmovieIdtagtimestamp
02226479Kevin Kline1583038886
12279592misogyny1581476297
222247150acrophobia1622483469
3342174music1249808064
4342174weird1249808102
ratings.head()
userIdmovieIdratingtimestamp
01174.0000944249077
11251.0000944250228
21292.0000943230976
31305.0000944249077
41325.0000943228858
%%time

# Create a temporal training split for modelling
RATING_THRESHOLD = 4.0

# Cutoff users based on time interval, 90%
user_cutoff = (ratings.groupby('userId')['timestamp'].quantile(0.90).rename('cutoff').astype('int32'))
ratings = ratings.join(user_cutoff, on = 'userId')

# Split the data for training
train_ratings = ratings[ratings['timestamp'] <  ratings['cutoff']]
test_ratings  = ratings[ratings['timestamp'] >= ratings['cutoff']]

# Drop cutoff column immediately - saves ~100 MB across both halves
train_ratings.drop(columns = ['cutoff'], inplace = True)
test_ratings.drop(columns = ['cutoff'], inplace = True)

del ratings, user_cutoff; gc.collect()

# Ground truth: movies rated >= threshold in test, not seen in training
train_seen = train_ratings.groupby('userId')['movieId'].apply(set)

# Function to extract the truth
def get_true_items(grp):

    uid   = grp.name
    liked = set(grp[grp['rating'] >= RATING_THRESHOLD]['movieId'])

    return liked - train_seen.get(uid, set())

# Apply func to get data
ground_truth = (test_ratings.groupby('userId').apply(get_true_items).rename('true_movies'))
ground_truth = ground_truth[ground_truth.map(len) > 0]

# Inspect
print(f'Train rows : {len(train_ratings):,}  | {train_ratings.memory_usage(deep = True).sum()/1e6:.0f} MB')
print(f'Test rows  : {len(test_ratings):,}   | {test_ratings.memory_usage(deep = True).sum()/1e6:.0f} MB')
print(f'Eval users : {len(ground_truth):,}')
Train rows : 28,583,599  | 800 MB
Test rows  : 3,416,605   | 96 MB
Eval users : 192,131
CPU times: user 1min 31s, sys: 3.75 s, total: 1min 35s
Wall time: 1min 36s
ground_truth.head()
true_movies
userId
1{1056, 1952, 2020, 645, 2313, 1228, 2125, 302,...
2{520, 616, 362, 783, 594}
3{4995, 5380, 1957, 2150, 11, 3248, 1873, 2353,...
4{2841}
5{480, 356, 454, 364, 110}


# Create a global popularity pool to use as a cold-start ratings guide
pop_stats = (train_ratings[train_ratings['rating'] >= RATING_THRESHOLD].groupby('movieId').agg(pos_count = ('rating','count'), mean_rating = ('rating','mean')).sort_values(['pos_count','mean_rating'], ascending = False))
POPULARITY_POOL = pop_stats.index.tolist()
print(f'Popularity pool: {len(POPULARITY_POOL):,} movies')
pop_stats.head(10).join(movies.set_index('movieId')['title'])
Popularity pool: 50,644 movies
pos_countmean_ratingtitle
movieId
318848614.6437Shawshank Redemption, The (1994)
296734384.6011Pulp Fiction (1994)
2571690174.5596Matrix, The (1999)
356688154.5235Forrest Gump (1994)
593671384.5071Silence of the Lambs, The (1991)
260606484.5635Star Wars: Episode IV - A New Hope (1977)
2959585224.5773Fight Club (1999)
527559454.5847Schindler's List (1993)
1196525834.5564Star Wars: Episode V - The Empire Strikes Back...
4993519694.5687Lord of the Rings: The Fellowship of the Ring,...
# Create a global popularity pool
GLOBAL_MEAN = float(train_ratings['rating'].mean())

pop_stats = (train_ratings[train_ratings['rating'] >= RATING_THRESHOLD].groupby('movieId').agg(pos_count = ('rating','count'), mean_rating = ('rating','mean')).sort_values(['pos_count','mean_rating'], ascending = False))
POPULARITY_POOL = pop_stats.index.tolist()

print(f'Global mean       : {GLOBAL_MEAN:.4f}')
print(f'Popularity pool   : {len(POPULARITY_POOL):,} movies')
pop_stats.head(8).join(movies.set_index('movieId')['title'])
Global mean       : 3.5460
Popularity pool   : 50,644 movies
pos_countmean_ratingtitle
movieId
318848614.6437Shawshank Redemption, The (1994)
296734384.6011Pulp Fiction (1994)
2571690174.5596Matrix, The (1999)
356688154.5235Forrest Gump (1994)
593671384.5071Silence of the Lambs, The (1991)
260606484.5635Star Wars: Episode IV - A New Hope (1977)
2959585224.5773Fight Club (1999)
527559454.5847Schindler's List (1993)

Back to top..


2. Exploratory data analysis

πŸ“– Understanding the data before modelling

This section answers eight diagnostic questions that directly motivate modelling decisions later:

QuestionModelling implication
How are ratings distributed across the 0.5–5.0 scale?Validates our 4.0 β€˜liked’ threshold
How many ratings does a typical user have?Informs cold-start threshold
How many ratings does a typical movie have?Informs Bayesian smoothing prior
Which genres dominate the catalogue?Explains genre-affinity feature range
How do mean ratings differ across genres?Reveals genre-level rating bias
How has rating volume changed over time?Motivates temporal recency weighting
How concentrated is movie popularity?Reveals long-tail severity
What do user rating habits look like?Identifies lenient vs strict raters
# Combine train + test for full-dataset EDA stats
n_tr, n_te    = len(train_ratings), len(test_ratings)
total_ratings = n_tr + n_te

# Union of user/movie IDs across splits without creating a combined DataFrame
users_tr  = set(train_ratings['userId'].unique())
users_te  = set(test_ratings['userId'].unique())
movies_tr = set(train_ratings['movieId'].unique())
movies_te = set(test_ratings['movieId'].unique())

n_users_total  = len(users_tr | users_te)
n_movies_rated = len(movies_tr | movies_te)
del users_tr, users_te, movies_tr, movies_te; gc.collect()

global_sum = float(train_ratings['rating'].sum()) + float(test_ratings['rating'].sum())
global_mean_all = global_sum / total_ratings
ts_min = min(int(train_ratings['timestamp'].min()), int(test_ratings['timestamp'].min()))
ts_max = max(int(train_ratings['timestamp'].max()), int(test_ratings['timestamp'].max()))
r_min  = min(float(train_ratings['rating'].min()), float(test_ratings['rating'].min()))
r_max  = max(float(train_ratings['rating'].max()), float(test_ratings['rating'].max()))
sparsity = 1 - total_ratings / (n_users_total * n_movies_rated)

print('=' * 55)
print('  MOVIELENS 32M  -  DATASET SNAPSHOT')
print('=' * 55)
print(f'  Total ratings      : {total_ratings:>12,}')
print(f'  Unique users       : {n_users_total:>12,}')
print(f'  Unique movies rated: {n_movies_rated:>12,}')
print(f'  Movies in catalogue: {len(movies):>12,}')
print(f'  Date range         : {pd.to_datetime(ts_min,unit="s").date()} -> {pd.to_datetime(ts_max,unit="s").date()}')
print(f'  Rating scale       : {r_min} - {r_max} stars')
print(f'  Mean rating        : {global_mean_all:.3f}')
print(f'  Matrix sparsity    : {sparsity:.4%}')
print('=' * 55)
=======================================================
  MOVIELENS 32M  -  DATASET SNAPSHOT
=======================================================
  Total ratings      :   32,000,204
  Unique users       :      200,948
  Unique movies rated:       84,432
  Movies in catalogue:       87,585
  Date range         : 1995-01-09 -> 2023-10-13
  Rating scale       : 0.5 - 5.0 stars
  Mean rating        : 3.540
  Matrix sparsity    : 99.8114%
=======================================================
# Combine data for vis
all_ratings_eda = pd.concat([train_ratings.assign(split='train'),test_ratings.assign(split='test')], ignore_index = True)
all_ratings_eda['date'] = pd.to_datetime(all_ratings_eda['timestamp'], unit='s')
all_ratings_eda['year'] = all_ratings_eda['date'].dt.year
all_ratings_eda['year_month'] = all_ratings_eda['date'].dt.to_period('M')
print(f'all_ratings_eda: {all_ratings_eda.shape}  |  '
      f'{all_ratings_eda.memory_usage(deep=True).sum()/1e6:.0f} MB')
all_ratings_eda: (32000204, 8)  |  3005 MB
# Ratings distribution
fig, axes = plt.subplots(1, 2, figsize=(20, 5))

# Left: count per star value
ax = axes[0]
rating_counts = all_ratings_eda['rating'].value_counts().sort_index()
bars = ax.bar(rating_counts.index, rating_counts.values,
              width=0.35, color=sns.color_palette('husl', len(rating_counts)))
ax.axvline(4.0, color='crimson', lw=2, ls='--', label='Liked threshold (4.0)')
ax.set_xlabel('Star Rating')
ax.set_ylabel('Number of Ratings')
ax.set_title('Rating Distribution (raw counts)')
ax.legend()
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x/1e6:.1f}M'))
for bar in bars:
    h = bar.get_height()
    ax.text(bar.get_x()+bar.get_width()/2, h*1.01,
            f'{h/1e6:.1f}M', ha='center', va='bottom', fontsize=7)

# Right: positive label rate vs threshold
ax2 = axes[1]
thresholds = np.arange(0.5, 5.5, 0.5)
pct_liked  = [(all_ratings_eda['rating'] >= t).mean() * 100 for t in thresholds]
ax2.plot(thresholds, pct_liked, marker='o', lw=2.5, color='steelblue')
ax2.axvline(4.0, color='crimson', lw=2, ls='--', label='Our threshold (4.0)')
chosen_pct = pct_liked[int((4.0-0.5)/0.5)]
ax2.axhline(chosen_pct, color='crimson', lw=1, ls=':')
ax2.annotate(f'{chosen_pct:.1f}% liked at 4.0',
             xy = (4.0, chosen_pct),
             xytext = (2.5, chosen_pct-16),
             arrowprops = dict(arrowstyle='->', color = 'crimson'),
             color = 'crimson',
             fontsize = 10)

ax2.set_xlabel('Threshold')
ax2.set_ylabel('% of Ratings Classified as Liked')
ax2.set_title('Positive Label Rate vs Threshold')
ax2.legend()
ax2.set_ylim(0, 105)

plt.suptitle('Rating Scale Analysis', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('eda_01_rating_distribution.png', dpi=150, bbox_inches='tight')
plt.show()
print(f'Ratings >= 4.0 stars: {(all_ratings_eda["rating"]>=4.0).mean():.1%} of all ratings')

png

Ratings >= 4.0 stars: 49.8% of all ratings
# User actvitity distribution
user_rating_counts = all_ratings_eda.groupby('userId')['rating'].count()

fig, axes = plt.subplots(1, 3, figsize=(20, 4))

# Histogram (log y)
ax = axes[0]
ax.hist(user_rating_counts, bins=80, color='steelblue', edgecolor='white', lw=0.3)
ax.set_yscale('log')
ax.set_xlabel('Ratings per User')
ax.set_ylabel('Number of Users (log)')
ax.set_title('User Activity Distribution')

for p in [10, 50, 90]:

    v = np.percentile(user_rating_counts, p)
    ax.axvline(v, ls='--', lw=1.2, label=f'p{p}={v:.0f}')

ax.legend(fontsize=8)

# CCDF β€” power-law tail
ax2 = axes[1]
sorted_counts = np.sort(user_rating_counts)[::-1]
rank = np.arange(1, len(sorted_counts)+1)
ax2.loglog(sorted_counts, rank / len(sorted_counts), lw=2, color='darkorange')
ax2.set_xlabel('Ratings per User (log)')
ax2.set_ylabel('P(X >= x)  [log]')
ax2.set_title('CCDF β€” Heavy Tail in User Activity')
ax2.grid(True, which='both', alpha=0.3)

# Activity segments
ax3 = axes[2]
bins_   = [0, 20, 50, 100, 250, 500, int(user_rating_counts.max())+1]
labels_ = ['<20', '20-50', '51-100', '101-250', '251-500', '500+']
seg    = pd.cut(user_rating_counts, bins=bins_, labels=labels_)
seg_ct = seg.value_counts().sort_index()
bars3  = ax3.bar(range(len(seg_ct)), seg_ct.values,
                 color=sns.color_palette('husl', len(seg_ct)),
                 tick_label=labels_)
ax3.set_xlabel('Ratings per User')
ax3.set_ylabel('Number of Users')
ax3.set_title('User Activity Segments')

for i, v in enumerate(seg_ct.values):

    ax3.text(i, v + 200, f'{v:,}', ha='center', va='bottom', fontsize=8)

plt.suptitle('User Activity Analysis', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('eda_02_user_activity.png', dpi=150, bbox_inches='tight')
plt.show()

print(f'Median ratings/user  : {user_rating_counts.median():.0f}')
print(f'Mean   ratings/user  : {user_rating_counts.mean():.0f}')
top1 = user_rating_counts.nlargest(int(len(user_rating_counts)*0.01)).sum()
print(f'Top 1% of users account for {top1/user_rating_counts.sum():.1%} of all ratings')

png

Median ratings/user  : 73
Mean   ratings/user  : 159
Top 1% of users account for 12.6% of all ratings
# Investigate movie popularities
movie_rating_counts = (all_ratings_eda.groupby('movieId')['rating'].count().sort_values(ascending=False))
movie_pop_named = (movie_rating_counts.to_frame('n_ratings').join(movies.set_index('movieId')['title']))

fig, axes = plt.subplots(1, 2, figsize=(20, 5))

# Log-log rank curve
ax = axes[0]
ranks = np.arange(1, len(movie_rating_counts)+1)
ax.loglog(ranks, movie_rating_counts.values, lw=2, color='steelblue')
ax.fill_between(ranks, movie_rating_counts.values, alpha=0.15, color='steelblue')
top1pct = int(len(movie_rating_counts)*0.01)
ax.axvline(top1pct, color='crimson', ls='--', lw=1.5,
           label=f'Top 1% = {top1pct:,} movies')
ax.set_xlabel('Movie Rank (log)')
ax.set_ylabel('Number of Ratings (log)')
ax.set_title('The Long Tail: Movie Popularity')
ax.legend(fontsize=9)

# Rating concentration curve
ax2 = axes[1]
cum_pct = movie_rating_counts.cumsum() / movie_rating_counts.sum() * 100
ax2.plot(range(len(cum_pct)), cum_pct.values, lw=2, color='darkorange')
for target in [50, 80, 95]:
    idx_pos = int((cum_pct >= target).argmax())
    ax2.axhline(target, ls=':', color='grey', lw=1)
    ax2.axvline(idx_pos, ls=':', color='grey', lw=1)
    ax2.text(idx_pos + 500, target - 4, f'{target}%: {idx_pos:,} movies',
             fontsize=8, color='dimgrey')
ax2.set_xlabel('Number of Movies (ranked by popularity)')
ax2.set_ylabel('Cumulative % of All Ratings')
ax2.set_title('Rating Concentration Curve')

plt.suptitle('Movie Popularity Analysis', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('eda_03_movie_longtail.png', dpi=150, bbox_inches='tight')
plt.show()

print('Top 10 most-rated movies:')
print(movie_pop_named.head(10).to_string())

png

Top 10 most-rated movies:
         n_ratings                                                      title
movieId                                                                      
318         102929                           Shawshank Redemption, The (1994)
356         100296                                        Forrest Gump (1994)
296          98409                                        Pulp Fiction (1994)
2571         93808                                         Matrix, The (1999)
593          90330                           Silence of the Lambs, The (1991)
260          85010                  Star Wars: Episode IV - A New Hope (1977)
2959         77332                                          Fight Club (1999)
480          75233                                       Jurassic Park (1993)
527          73849                                    Schindler's List (1993)
4993         73122  Lord of the Rings: The Fellowship of the Ring, The (2001)
%%time

# Analyze the genres
movies_exploded = (movies.assign(genre=movies['genres'].str.split('|')).explode('genre').query("genre != '(no genres listed)'"))
genre_ratings = (all_ratings_eda.merge(movies_exploded[['movieId','genre']], on='movieId', how='left').dropna(subset=['genre']))

genre_stats = genre_ratings.groupby('genre').agg(n_ratings     = ('rating','count'),
                                                 mean_rating   = ('rating','mean'),
                                                 median_rating = ('rating','median'),
                                                 n_movies      = ('movieId','nunique'),).sort_values('n_ratings', ascending = False)
CPU times: user 44.7 s, sys: 9.67 s, total: 54.4 s
Wall time: 54.4 s
fig, axes = plt.subplots(1, 2, figsize=(20, 6))

# Volume per genre
ax = axes[0]
palette = sns.color_palette('husl', len(genre_stats))
ax.barh(genre_stats.index, genre_stats['n_ratings'], color=palette[::-1])
ax.invert_yaxis()
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x/1e6:.0f}M'))
ax.set_xlabel('Total Ratings')
ax.set_title('Rating Volume by Genre')

# Mean rating per genre (dot plot)
ax2 = axes[1]
gs_sorted = genre_stats.sort_values('mean_rating')
g_mean = all_ratings_eda['rating'].mean()
colors = ['#2ecc71' if v >= g_mean + 0.15
          else '#e74c3c' if v < g_mean - 0.15
          else '#3498db' for v in gs_sorted['mean_rating']]
ax2.scatter(gs_sorted['mean_rating'], range(len(gs_sorted)),
            s=gs_sorted['n_movies']/15, c=colors, alpha=0.85, zorder=3)
ax2.set_yticks(range(len(gs_sorted)))
ax2.set_yticklabels(gs_sorted.index)
ax2.axvline(g_mean, color='black', ls='--', lw=1.2,
            label=f'Global mean ({g_mean:.2f})')
ax2.set_xlabel('Mean Star Rating')
ax2.set_title('Mean Rating by Genre\n(bubble size = # movies in genre)')
ax2.legend(fontsize=9)

plt.suptitle('Genre Analysis', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('eda_04_genre_analysis.png', dpi=150, bbox_inches='tight')
plt.show()
print(genre_stats[['n_movies','n_ratings','mean_rating','median_rating']].to_string())

png

             n_movies  n_ratings  mean_rating  median_rating
genre                                                       
Drama           33152   13973271       3.6825         4.0000
Comedy          22448   11206926       3.4324         3.5000
Action           9296    9665213       3.4764         3.5000
Thriller        11555    8679464       3.5317         3.5000
Adventure        5156    7590522       3.5234         3.5000
Sci-Fi           4797    5717337       3.4917         3.5000
Romance         10048    5524615       3.5450         3.5000
Crime            6704    5373051       3.6918         4.0000
Fantasy          3784    3702759       3.5122         3.5000
Children         4447    2731841       3.4392         3.5000
Mystery          3894    2615322       3.6731         4.0000
Horror           8468    2492315       3.3072         3.5000
Animation        4586    2214562       3.6153         4.0000
War              2225    1594110       3.7917         4.0000
IMAX              195    1494179       3.5933         4.0000
Musical          1033    1159516       3.5543         4.0000
Western          1489     596654       3.6002         4.0000
Documentary      9103     427353       3.6912         4.0000
Film-Noir         350     304710       3.9158         4.0000
# Genre stats
def movie_level_agg(df):

    return df.groupby('movieId').agg(n = ('rating','count'), s=('rating','sum'))

ml_tr = movie_level_agg(train_ratings)
ml_te = movie_level_agg(test_ratings)
ml    = ml_tr.add(ml_te, fill_value=0).fillna(0)
ml['mean_rating'] = ml['s'] / ml['n'].clip(lower=1)
del ml_tr, ml_te; gc.collect()

# Explode the movie catalogue by genre
movies_exp = (movies.assign(genre = movies['genres'].str.split('|')).explode('genre').query("genre != '(no genres listed)'")[['movieId','genre']])
genre_movie = movies_exp.merge(ml, on='movieId', how='left').fillna(0)
del ml; gc.collect()

# Genre stats
genre_stats = genre_movie.groupby('genre').agg(n_ratings   = ('n', 'sum'),
                                               sum_rating  = ('s', 'sum'),
                                               n_movies    = ('movieId','nunique'),)

genre_stats['mean_rating']   = genre_stats['sum_rating'] / genre_stats['n_ratings'].clip(lower=1)
genre_stats['median_rating'] = (genre_movie.groupby('genre')['mean_rating'].median())

genre_stats = genre_stats.drop(columns='sum_rating').sort_values('n_ratings', ascending=False)
del genre_movie; gc.collect()
0
# Generate co-occurence rates for the entire catalogue
genre_list_per_movie = (
        movies.assign(genre=movies['genres'].str.split('|'))
        .set_index('movieId')['genre']
    )
top_genres = genre_stats.index[:15].tolist()
g2i = {g: i for i, g in enumerate(top_genres)}

cooc = np.zeros((len(top_genres), len(top_genres)), dtype=np.int32)

for glist in genre_list_per_movie:

    gs = [g for g in glist if g in g2i]

    for g in gs:

        cooc[g2i[g], g2i[g]] += 1

    for a, b in combinations(gs, 2):

        cooc[g2i[a], g2i[b]] += 1
        cooc[g2i[b], g2i[a]] += 1

cooc_df = pd.DataFrame(cooc, index=top_genres, columns=top_genres)
diag = np.diag(cooc).astype(float)
jaccard = cooc / (diag[:, None] + diag[None, :] - cooc + 1e-9)
jaccard_df = pd.DataFrame(jaccard, index=top_genres, columns=top_genres)

fig, axes = plt.subplots(1, 2, figsize = (20, 6))

sns.heatmap(cooc_df/1000, ax=axes[0], cmap='YlOrRd', annot = False, linewidths = 0.3, cbar_kws = {'label': 'Co-occurrences (thousands)'})

axes[0].set_title('Genre Co-occurrence Count')
mask = np.eye(len(top_genres), dtype=bool)

sns.heatmap(jaccard_df, ax=axes[1], cmap='Blues', mask=mask,
                annot=True, fmt='.2f', annot_kws={'size':7},
                linewidths=0.3, vmin=0, vmax=0.5)
axes[1].set_title('Genre Co-occurrence (Jaccard) 1.0 = always appear together')

plt.suptitle('Genre Co-occurrence in the Catalogue', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('eda_05_genre_cooccurrence.png', dpi=150, bbox_inches='tight')
plt.show()

png

# Temporal trends
SECS_PER_YEAR = 365.2425 * 86400

def year_agg(df):

    year = ((df['timestamp'].astype('float32') / SECS_PER_YEAR) + 1970).astype('int16')
    tmp  = pd.DataFrame({'year': year, 'r': df['rating'], 'u': df['userId']})
    out  = tmp.groupby('year').agg(n=('r','count'), s=('r','sum'), nu=('u','nunique'))
    del tmp, year
    return out

yr_tr = year_agg(train_ratings)
yr_te = year_agg(test_ratings)
yearly = pd.DataFrame({
    'n_ratings'  : yr_tr['n'].add(yr_te['n'],  fill_value=0).astype(int),
    'sum_rating' : yr_tr['s'].add(yr_te['s'],  fill_value=0),
    'n_users'    : yr_tr['nu'].add(yr_te['nu'], fill_value=0).astype(int),  # approx
}).reset_index()
yearly['mean_rating'] = yearly['sum_rating'] / yearly['n_ratings'].clip(lower=1)
del yr_tr, yr_te; gc.collect()

# Monthly (only recent years, parse subset to reduce datetime overhead)
RECENT_TS = int((2013 - 1970) * SECS_PER_YEAR)
def monthly_recent(df, cutoff_ts):
    sub = df[df['timestamp'] >= cutoff_ts][['timestamp','rating']].copy()
    sub['period'] = pd.to_datetime(sub['timestamp'], unit='s').dt.to_period('M')
    out = sub.groupby('period')['rating'].count()
    del sub; return out

mo_tr = monthly_recent(train_ratings, RECENT_TS)
mo_te = monthly_recent(test_ratings,  RECENT_TS)
monthly = mo_tr.add(mo_te, fill_value=0).reset_index()
monthly.columns = ['period','n_ratings']
monthly['dt'] = monthly['period'].dt.to_timestamp()
del mo_tr, mo_te; gc.collect()

fig = plt.figure(figsize = (24, 9))
gs2 = GridSpec(2, 2, figure=fig, hspace=0.4, wspace=0.32)

ax1 = fig.add_subplot(gs2[0, :])
bars = ax1.bar(yearly['year'], yearly['n_ratings']/1e6,
               color=sns.color_palette('husl', len(yearly)))
ax1.set_xlabel('Year'); ax1.set_ylabel('Ratings (millions)')
ax1.set_title('Annual Rating Volume (1995-2023)')
ax1.bar_label(bars, fmt='{:.1f}M', padding=2, fontsize=7)

ax2 = fig.add_subplot(gs2[1, 0])
ax2.plot(monthly['dt'], monthly['n_ratings']/1e3, lw=1.5, color='steelblue')
ax2.fill_between(monthly['dt'], monthly['n_ratings']/1e3, alpha=0.15)
ax2.set_xlabel('Month'); ax2.set_ylabel('Ratings (thousands)')
ax2.set_title('Monthly Volume (2013-2023)')

ax3 = fig.add_subplot(gs2[1, 1])
ax3.plot(yearly['year'], yearly['mean_rating'], marker='o', lw=2, color='darkorange')
ax3.axhline(global_mean_all, ls='--', color='grey', lw=1.2,
            label=f'Overall mean ({global_mean_all:.2f})')
ax3.set_ylim(3.2, 4.1); ax3.set_xlabel('Year'); ax3.set_ylabel('Mean Rating')
ax3.set_title('Mean Rating Trend Over Time'); ax3.legend(fontsize=9)

plt.suptitle('Temporal Trends in Rating Activity', fontsize=14, y=1.01)
plt.savefig('eda_06_temporal_trends.png', dpi=150, bbox_inches='tight')
plt.show()

png

# Perform cold start analysis
lk_tr = train_ratings[train_ratings['rating'] >= RATING_THRESHOLD].groupby('userId')['rating'].count()
lk_te = test_ratings[ test_ratings['rating']  >= RATING_THRESHOLD].groupby('userId')['rating'].count()
user_liked_counts_all = lk_tr.add(lk_te, fill_value=0).astype(int).rename('liked_count')
del lk_tr, lk_te; gc.collect()

COLD_START_THRESHOLD = 5

fig, axes = plt.subplots(1, 2, figsize=(24, 5))

ax = axes[0]
ax.hist(user_liked_counts_all.clip(upper=300), bins=60,
        color='mediumseagreen', edgecolor='white', lw=0.3)
for thresh in [5, 10, 20]:
    ax.axvline(thresh, ls='--', lw=1.5,
               label=f'thresh={thresh}  ({(user_liked_counts_all<thresh).mean():.1%} cold)')
ax.set_xlabel('Liked Ratings per User (clipped at 300)')
ax.set_ylabel('Users'); ax.set_title('Liked-Rating Count Distribution')
ax.legend(fontsize=8)

ax2 = axes[1]
thr_range = np.arange(1, 101)
pct_warm  = [(user_liked_counts_all >= t).mean()*100 for t in thr_range]
ax2.plot(thr_range, pct_warm, lw=2.5, color='mediumseagreen')
ax2.axvline(COLD_START_THRESHOLD, color='crimson', ls='--', lw=2,
            label=f'Chosen = {COLD_START_THRESHOLD}')
chosen_warm = pct_warm[COLD_START_THRESHOLD-1]
ax2.axhline(chosen_warm, color='crimson', ls=':', lw=1)
ax2.annotate(f'{chosen_warm:.1f}% warm',
             xy=(COLD_START_THRESHOLD, chosen_warm),
             xytext=(COLD_START_THRESHOLD+6, chosen_warm-8),
             arrowprops=dict(arrowstyle='->', color='crimson'),
             color='crimson', fontsize=10)
ax2.set_xlabel('Minimum Liked Ratings')
ax2.set_ylabel('% Users Classified as Warm')
ax2.set_title('Warm User Rate vs Cold-Start Threshold')
ax2.legend(fontsize=9)

plt.suptitle('Cold-Start Analysis', fontsize = 14, y = 1.02)
plt.tight_layout()
plt.savefig('eda_08_cold_start.png', dpi = 150, bbox_inches = 'tight')
plt.show()
del user_liked_counts_all; gc.collect()

png

34387

Back to top


3. Feature engineering

We build three reusable feature tables:

  • user_stats β€” per-user: rating count, mean rating, activity span, rating frequency
  • movie_feat β€” per-movie: global count, mean rating, Bayesian-smoothed score, genre one-hot
  • user_genre_affinity β€” per-user per-genre: average rating given to each genre

Here we lean on explicit rating magnitudes and genre vectors as the primary content signal.

Bayesian-smoothed score: shrinks small-count movies toward the global mean.

\[\text{bayes\_score}(m) = \frac{N_m \cdot \bar{r}_m + C \cdot \bar{r}_{\text{global}}}{N_m + C}\]

where $C = 50$ is the prior strength (equivalent rating count).

# Compile user stats
liked_mask = train_ratings['rating'] >= RATING_THRESHOLD
user_liked = (train_ratings[liked_mask].groupby('userId')['movieId'].apply(list))
user_stats = train_ratings.groupby('userId').agg(
    rating_count = ('rating','count'),
    mean_rating  = ('rating','mean'),
    rating_std   = ('rating','std'),
    last_ts      = ('timestamp','max'),
    first_ts     = ('timestamp','min'),)

user_stats['active_days'] = ((user_stats['last_ts'] - user_stats['first_ts']) / 86_400).clip(lower=1).astype('float32')
user_stats['rating_freq'] = (user_stats['rating_count'] / user_stats['active_days']).astype('float32')
user_stats['mean_rating'] = user_stats['mean_rating'].astype('float32')

# Users with a single rating have undefined std; fill with 0
user_stats['rating_std']  = user_stats['rating_std'].fillna(0).astype('float32')
print('user_liked:', len(user_liked), 'users')
print('user_stats:', user_stats.shape)
user_liked: 200561 users
user_stats: (200880, 7)
# Extract genres and perform movie one hot encoding
genres_exploded = (movies.assign(genre = movies['genres'].str.split('|')).explode('genre').query("genre != '(no genres listed)'"))

ALL_GENRES = sorted(genres_exploded['genre'].unique())
print(f'Genres ({len(ALL_GENRES)}):', ALL_GENRES)

movie_genre = pd.get_dummies(genres_exploded.set_index('movieId')['genre'], prefix='g')
movie_genre = movie_genre.groupby(level=0).max()
genre_cols  = movie_genre.columns.tolist()

print('movie_genre shape:', movie_genre.shape)
Genres (19): ['Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
movie_genre shape: (80505, 19)
# Generate Bayesian scores and movie statistics
GLOBAL_MEAN = train_ratings['rating'].mean()
C = 50

movie_stats = (
    train_ratings
    .groupby('movieId')
    .agg(global_count    = ('rating','count'),
         global_mean     = ('rating','mean'),
         rating_variance = ('rating','var'))
)
movie_stats['bayesian_score'] = (
    (movie_stats['global_count'] * movie_stats['global_mean'] + C * GLOBAL_MEAN)
    / (movie_stats['global_count'] + C)
)
# log1p compresses the very wide count range so the model sees a
# smooth popularity signal rather than a skewed raw count
movie_stats['log_count'] = np.log1p(movie_stats['global_count']).astype('float32')
# Single-rating movies have undefined variance; treat as 0
movie_stats['rating_variance'] = movie_stats['rating_variance'].fillna(0).astype('float32')

# Tag count from tags.csv (loaded in Section 1) β€” log1p scaled.
# Tags are user-applied descriptors; a high tag count signals that a
# movie has attracted community engagement beyond a simple star rating.
movie_tag_count = tags.groupby('movieId')['tag'].count().rename('tag_count')
movie_stats = movie_stats.join(movie_tag_count, how='left')
movie_stats['tag_count'] = np.log1p(movie_stats['tag_count'].fillna(0)).astype('float32')
del movie_tag_count

movie_feat = movie_stats.join(movie_genre, how='left').fillna(0)
print(f'movie_feat shape: {movie_feat.shape}   global mean: {GLOBAL_MEAN:.4f}')
movie_feat[['global_count','log_count','global_mean','bayesian_score','rating_variance','tag_count']].describe()

movie_feat shape: (77369, 25)   global mean: 3.5460
global_countlog_countglobal_meanbayesian_scorerating_variancetag_count
count77369.000077369.000077369.000077369.000077369.000077369.0000
mean369.44512.47343.01133.46060.87261.3936
std2497.71952.02300.82080.20171.01821.5596
min1.00000.69310.50001.27880.00000.0000
25%2.00001.09862.54623.43960.00000.0000
50%5.00001.79183.07893.51320.74551.0986
75%25.00003.25813.50003.54511.18642.1972
max98368.000011.49655.00004.437310.12508.8096
# User genre affinity matrix
liked_fe = (train_ratings[train_ratings['rating'] >= RATING_THRESHOLD][['userId','movieId','rating']])
uid_uniq = liked_fe['userId'].unique()
mid_uniq = liked_fe['movieId'].unique()

u2i = {u: i for i, u in enumerate(uid_uniq)}
m2i = {m: i for i, m in enumerate(mid_uniq)}

rows_fe = liked_fe['userId'].map(u2i).values
cols_fe = liked_fe['movieId'].map(m2i).values

del liked_fe; gc.collect()

# Binary matrix: 1 where user liked a movie, 0 elsewhere.
# Using rating values in the numerator (previous approach) inflates genres
# where the user gave 5-star ratings vs 4-star ratings for the same number
# of films. Count-based gives the clean interpretation: fraction of a
# user's liked movies that belong to each genre.
R_bin = csr_matrix(
    (np.ones(len(rows_fe), dtype='float32'), (rows_fe, cols_fe)),
    shape=(len(uid_uniq), len(mid_uniq)),
)
del rows_fe, cols_fe; gc.collect()

# Genre matrix aligned to the liked-movie order (float32)
G = movie_genre.reindex(mid_uniq).fillna(0).values.astype('float32')

# Sparse-binary x dense multiply -> (n_users x n_genres) count matrix
uga_counts   = R_bin @ G
liked_counts = np.asarray(R_bin.sum(axis=1)).flatten()
del R_bin, G; gc.collect()

user_genre_affinity = pd.DataFrame(
    uga_counts / liked_counts[:, None].clip(min=1),
    index=uid_uniq,
    columns=genre_cols,
    dtype='float32',
)
del uga_counts, liked_counts; gc.collect()

print(f'user_genre_affinity: {user_genre_affinity.shape}  |  '
      f'{user_genre_affinity.memory_usage(deep=True).sum()/1e6:.0f} MB')

user_genre_affinity: (200561, 19)  |  16 MB
user_genre_affinity.head()
g_Actiong_Adventureg_Animationg_Childreng_Comedyg_Crimeg_Documentaryg_Dramag_Fantasyg_Film-Noirg_Horrorg_IMAXg_Musicalg_Mysteryg_Romanceg_Sci-Fig_Thrillerg_Warg_Western
10.19720.15490.00000.01410.36620.12680.00000.71830.02820.01410.01410.00000.00000.08450.23940.15490.12680.16900.0141
20.16220.10810.13510.16220.48650.08110.00000.48650.08110.00000.00000.05410.13510.02700.51350.00000.24320.02700.0000
30.36990.45210.12330.15070.28770.09590.00000.47950.08220.00000.01370.04110.09590.01370.19180.20550.20550.12330.0137
40.20000.40000.00000.00000.60000.00000.00000.20000.20000.00000.00000.00000.00000.00000.00000.20000.00000.20000.0000
50.75000.37500.00000.00000.12500.37500.00000.25000.00000.00000.00000.00000.00000.00000.12500.00001.00000.12500.0000
# Split data for training / validation
all_users = ground_truth.index.values
gss = GroupShuffleSplit(n_splits = 1, test_size = 0.15, random_state = 100)
train_idx, eval_idx = next(gss.split(all_users, groups=all_users))

train_fold_users = set(all_users[train_idx])
eval_fold_users  = set(all_users[eval_idx])
assert not train_fold_users & eval_fold_users, 'Leakage!'

eval_df = (ground_truth[ground_truth.index.isin(eval_fold_users)].reset_index().rename(columns = {'true_movies': 'true_items'}))

warm_users = set(user_liked[user_liked.map(len) >= COLD_START_THRESHOLD].index)
eval_warm  = eval_df[eval_df['userId'].isin(warm_users)].reset_index(drop=True)

print(f'Train fold : {len(train_fold_users):,} users')
print(f'Eval  fold : {len(eval_fold_users):,} users')
print(f'Warm eval  : {len(eval_warm):,} users')
cold_pct = (len(eval_df) - len(eval_warm)) / len(eval_df)
print(f'Cold eval  : {len(eval_df)-len(eval_warm):,} ({cold_pct:.1%})')

# Populate _seen_cache here β€” before any training loop that calls get_seen().
# The empty-dict initialisation in the function cell is a safe fallback;
# this replaces it with the real per-user seen sets so that all retriever
# functions correctly exclude already-rated movies during training.
_seen_cache = train_ratings.groupby('userId')['movieId'].apply(set).to_dict()
print(f'_seen_cache populated: {len(_seen_cache):,} users')

Train fold : 163,311 users
Eval  fold : 28,820 users
Warm eval  : 28,509 users
Cold eval  : 311 (1.1%)
_seen_cache populated: 200,880 users

Back to top..


4. Item-based collaborative filtering

πŸ“– Batched sparse similarity

We build a sparse (movie x user) rating matrix, L2-normalise each movie’s row, then compute cosine similarity in batches of 2,000 movies. This avoids an 87k x 87k dense matrix (~30 GB). The full sparse matrix and its normalised copy are freed immediately after the lookup table is populated.

# Build a movie x user rating matrix
liked_cf = (train_ratings[train_ratings['rating'] >= RATING_THRESHOLD][['userId','movieId','rating']].copy())

movie_ids_cf = liked_cf['movieId'].unique()
user_ids_cf  = liked_cf['userId'].unique()

m_idx = {m: i for i, m in enumerate(movie_ids_cf)}
u_idx = {u: i for i, u in enumerate(user_ids_cf)}
idx_m = {i: m for m, i in m_idx.items()}

R = csr_matrix((liked_cf['rating'].values.astype('float32'), (liked_cf['movieId'].map(m_idx).values, liked_cf['userId'].map(u_idx).values)), shape = (len(movie_ids_cf), len(user_ids_cf)),)

del liked_cf; gc.collect()

R_norm = normalize(R, norm='l2', axis=1)
print(f'Rating matrix: {R.shape}  nnz={R.nnz:,}')
print(f'Memory: R={R.data.nbytes/1e6:.0f} MB  R_norm={R_norm.data.nbytes/1e6:.0f} MB')
Rating matrix: (50644, 200561)  nnz=14,311,277
Memory: R=57 MB  R_norm=57 MB
# Batched KNN, top-K neighbors
K_NN  = 50
BATCH = 2_000
item_sim_lookup = {}
n_movies_cf = R_norm.shape[0]
t0 = time.time()

for start in range(0, n_movies_cf, BATCH):

    batch = R_norm[start : start + BATCH]
    sims  = (batch @ R_norm.T).toarray()

    for local_i, sim_row in enumerate(sims):

        global_i = start + local_i
        sim_row[global_i] = 0.0
        top_k = np.argpartition(sim_row, -K_NN)[-K_NN:]
        top_k = top_k[np.argsort(sim_row[top_k])[::-1]]
        movie_id = idx_m[global_i]
        item_sim_lookup[movie_id] = [(idx_m[j], float(sim_row[j])) for j in top_k]

    if start % 20000 == 0:

        print(f'  {start:>6}/{n_movies_cf}  {time.time()-t0:.0f}s')

del R, R_norm; gc.collect()
print(f'Item-sim lookup built for {len(item_sim_lookup):,} movies in {time.time()-t0:.0f}s')
       0/50644  14s
   20000/50644  86s
   40000/50644  156s
Item-sim lookup built for 50,644 movies in 190s

Back to top..


5. Temporal sequence modelling

πŸ“– Recency-weighted taste vectors

We pre-compute per-user genre taste vectors using exponential recency weighting (half-life = 365 days). The movie genre matrix is pre-normalised as float32 and stored once; per-user dot products at inference time are fast matrix-vector multiplies.

# Compute recent user tastes
DECAY_HALF_LIFE = 365
DECAY_K         = np.log(2) / DECAY_HALF_LIFE
now_ts          = int(train_ratings['timestamp'].max())

liked_ts = train_ratings[train_ratings['rating'] >= RATING_THRESHOLD][['userId','movieId','timestamp']]

# Compute decay weight per row as a float32 column
liked_ts['weight'] = np.exp(-DECAY_K * (now_ts - liked_ts['timestamp'].values.astype('float32')) / 86400).astype('float32')

uid_uniq = liked_ts['userId'].unique()
mid_uniq = liked_ts['movieId'].unique()
u2i = {u: i for i, u in enumerate(uid_uniq)}
m2i = {m: i for i, m in enumerate(mid_uniq)}

W = csr_matrix((liked_ts['weight'].values,
               (liked_ts['userId'].map(u2i).values,
               liked_ts['movieId'].map(m2i).values)),
               shape = (len(uid_uniq), len(mid_uniq)),)

del liked_ts; gc.collect()

# Genre matrix aligned to mid_uniq order
G = movie_genre.reindex(mid_uniq).fillna(0).values.astype('float32')

# One sparse matmul
taste_raw = W @ G
row_norms  = np.linalg.norm(taste_raw, axis = 1, keepdims = True).clip(min = 1e-9)
taste_norm = (taste_raw / row_norms).astype('float32')

user_taste_vec = {uid_uniq[i]: taste_norm[i] for i in range(len(uid_uniq))}

del W, G, taste_raw, taste_norm, row_norms; gc.collect()
print(f'Taste vectors built for {len(user_taste_vec):,} users')
Taste vectors built for 200,561 users
# Build a normalised movie-genre matrix for S2 and S4 retrievers.
# Expanding to all rated movies (movie_feat.index) rather than just
# POPULARITY_POOL raises the retrieval ceiling: S2/S4 were previously
# blind to ~27K movies that had ratings but no liked (4+) ratings.
movie_genre_aligned = movie_genre.reindex(movie_feat.index).fillna(0)
movie_genre_norm    = normalize(movie_genre_aligned.values.astype('float32'), norm='l2', axis=1)
genre_index_order   = movie_genre_aligned.index.tolist()

del movie_genre_aligned; gc.collect()

print(f'genre_norm matrix: {movie_genre_norm.shape}  |  '
      f'{movie_genre_norm.nbytes/1e6:.0f} MB')
genre_norm matrix: (77369, 19)  |  6 MB
# Assemble the inputs for LightGBM
FEATURE_COLS = ['u_rating_count', 'u_mean_rating', 'u_rating_freq', 'u_rating_std', 'm_global_count', 'm_log_count', 'm_global_mean', 'm_bayesian', 'm_rating_variance', 'm_tag_count', 'genre_affinity',]
TRAIN_USERS_SAMPLE = 8000
NEGATIVES_PER_POS  = 4
K_CAND             = 200

rng = np.random.default_rng(100)

sample_users = rng.choice(list(train_fold_users & warm_users), size    = min(TRAIN_USERS_SAMPLE, len(train_fold_users & warm_users)), replace = False,)

# ── Batch-precompute S2 and S4 genre scores ──────────────────────────────
# s2_personal and s4_temporal each run movie_genre_norm @ vector inside the
# loop. For 8,000 users that is 16,000 individual (77KΓ—19)@(19,) BLAS calls.
# Two batch matmuls replace all of them; BLAS executes a single fused kernel
# that is 10-50x faster than the equivalent number of small calls.

# S2 batch: normalise each user's genre-affinity row, then one matmul
uga_sample = user_genre_affinity.reindex(sample_users).fillna(0).values.astype('float32')
uga_norms  = np.linalg.norm(uga_sample, axis=1, keepdims=True).clip(min=1e-9)
s2_scores  = movie_genre_norm @ (uga_sample / uga_norms).T  # (n_movies, n_users)
del uga_sample, uga_norms; gc.collect()

# S4 batch: taste vectors are already unit-normalised
taste_sample = np.vstack([user_taste_vec.get(uid, np.zeros(len(genre_cols), dtype='float32')) for uid in sample_users]).astype('float32')
s4_scores = movie_genre_norm @ taste_sample.T   # (n_movies, n_users)
del taste_sample; gc.collect()

print(f's2_scores: {s2_scores.shape}  s4_scores: {s4_scores.shape}  '
      f'({(s2_scores.nbytes + s4_scores.nbytes)/1e6:.0f} MB combined)')

# ── Build (userId, movieId, label) pairs ─────────────────────────────────
# Each iteration slices one precomputed column (O(1)) instead of running a
# fresh matmul. np.argpartition finds the top-K in O(n) vs O(n log n) for
# a full argsort; only the K retained entries are then sorted.

pair_rows = []

for i, uid in enumerate(sample_users):

    liked_list = user_liked.get(uid, [])

    if not liked_list:

        continue

    seen = get_seen(uid)

    # S2 candidates β€” slice precomputed column, partition, filter seen
    s2_col   = s2_scores[:, i]
    s2_top   = np.argpartition(s2_col, -K_CAND)[-K_CAND:]
    s2_top   = s2_top[np.argsort(s2_col[s2_top])[::-1]]
    s2_cands = [genre_index_order[j] for j in s2_top if genre_index_order[j] not in seen]

    # S3 candidates β€” item-CF uses irregular dict structure; stays sequential
    s3_cands = s3_item_cf(uid, K=K_CAND)

    # S4 candidates β€” slice precomputed column, partition, filter seen
    s4_col   = s4_scores[:, i]
    s4_top   = np.argpartition(s4_col, -K_CAND)[-K_CAND:]
    s4_top   = s4_top[np.argsort(s4_col[s4_top])[::-1]]
    s4_cands = [genre_index_order[j] for j in s4_top if genre_index_order[j] not in seen]

    liked_set  = set(liked_list)
    neg_pool   = list(dict.fromkeys(s2_cands + s3_cands + s4_cands))
    neg_pool   = [m for m in neg_pool if m not in liked_set]
    n_neg      = min(len(liked_list) * NEGATIVES_PER_POS, len(neg_pool))
    neg_movies = rng.choice(neg_pool, size=n_neg, replace=False).tolist() if n_neg > 0 else []

    for m in liked_list:

        pair_rows.append((uid, m, 1))

    for m in neg_movies:

        pair_rows.append((uid, m, 0))

del s2_scores, s4_scores; gc.collect()

pairs = pd.DataFrame(pair_rows, columns = ['userId', 'movieId', 'label'])
del pair_rows; gc.collect()
s2_scores: (77369, 8000)  s4_scores: (77369, 8000)  (4952 MB combined)





0
# Join user features
u_feat = user_stats[['rating_count','mean_rating','rating_freq','rating_std']].rename(columns = {'rating_count': 'u_rating_count',
                                                                                                 'mean_rating' : 'u_mean_rating',
                                                                                                 'rating_freq' : 'u_rating_freq',
                                                                                                 'rating_std'  : 'u_rating_std',})

pairs = pairs.merge(u_feat, left_on = 'userId', right_index = True, how = 'left')
pairs.head()
userIdmovieIdlabelu_rating_countu_mean_ratingu_rating_frequ_rating_std
0196703101343.647134.00000.9173
1196703161343.647134.00000.9173
2196703221343.647134.00000.9173
3196703471343.647134.00000.9173
4196703581343.647134.00000.9173
# Join movie features
m_feat = movie_feat[['global_count','log_count','global_mean','bayesian_score', 'rating_variance','tag_count']].rename(columns={'global_count'   : 'm_global_count',
                                                                                                                                'log_count'      : 'm_log_count',
                                                                                                                                'global_mean'    : 'm_global_mean',
                                                                                                                                'bayesian_score' : 'm_bayesian',
                                                                                                                                'rating_variance': 'm_rating_variance',
                                                                                                                                'tag_count'      : 'm_tag_count',})
pairs = pairs.merge(m_feat, left_on = 'movieId', right_index = True, how='left')

# Compute genre affinity via stacked dot products. L2-normalise the movie genre vectors so the result is cosine similarity β€” consistent with the per-movie normalisation inside build_lgb_features.
taste_matrix    = np.vstack([user_taste_vec.get(uid, np.zeros(len(genre_cols), dtype='float32')) for uid in pairs['userId']]).astype('float32')
genre_matrix_raw = movie_genre.reindex(pairs['movieId']).fillna(0).values.astype('float32')
genre_norms      = np.linalg.norm(genre_matrix_raw, axis=1, keepdims=True).clip(min=1e-9)
genre_matrix     = genre_matrix_raw / genre_norms
pairs['genre_affinity'] = (taste_matrix * genre_matrix).sum(axis=1).astype('float32')
del taste_matrix, genre_matrix_raw, genre_norms, genre_matrix; gc.collect()

# Fill missing with global vals
pairs['u_rating_count']    = pairs['u_rating_count'].fillna(0).astype('float32')
pairs['u_mean_rating']     = pairs['u_mean_rating'].fillna(GLOBAL_MEAN).astype('float32')
pairs['u_rating_freq']     = pairs['u_rating_freq'].fillna(0).astype('float32')
pairs['u_rating_std']      = pairs['u_rating_std'].fillna(0).astype('float32')
pairs['m_global_count']    = pairs['m_global_count'].fillna(0).astype('float32')
pairs['m_log_count']       = pairs['m_log_count'].fillna(0).astype('float32')
pairs['m_global_mean']     = pairs['m_global_mean'].fillna(GLOBAL_MEAN).astype('float32')
pairs['m_bayesian']        = pairs['m_bayesian'].fillna(GLOBAL_MEAN).astype('float32')
pairs['m_rating_variance'] = pairs['m_rating_variance'].fillna(0).astype('float32')
pairs['m_tag_count']       = pairs['m_tag_count'].fillna(0).astype('float32')

lgb_train_df = pairs
del pairs; gc.collect()
print(f'LightGBM set: {lgb_train_df.shape}  pos rate: {lgb_train_df["label"].mean():.2%}')
LightGBM set: (2275394, 14)  pos rate: 26.12%
lgb_train_df.head()
userIdmovieIdlabelu_rating_countu_mean_ratingu_rating_frequ_rating_stdm_global_countm_log_countm_global_meanm_bayesianm_rating_variancem_tag_countgenre_affinity
019670310134.00003.647134.00000.917330624.000010.32963.42523.42540.76416.07530.4776
119670316134.00003.647134.00000.917320209.00009.91393.83383.83310.72976.75580.6299
219670322134.00003.647134.00000.91739327.00009.14083.30933.31060.84774.64440.7114
319670347134.00003.647134.00000.917358465.000010.97624.08744.08690.71397.86520.4049
419670358134.00003.647134.00000.917310647.00009.27313.96463.96270.94413.63760.7347

Back to top..


6. Evaluation harness & S1-S5 strategies

Five metrics at K=5 and K=10. The composite score is their unweighted mean. All strategies exclude movies already seen in train_ratings.

# Train base LightGBM model
xtrain = lgb_train_df[FEATURE_COLS].values
ytrain = lgb_train_df['label'].values
del lgb_train_df; gc.collect()

# Hold out 15% of rows as a proper validation set for early stopping
x_tr, x_val, y_tr, y_val = train_test_split(xtrain, ytrain, test_size = 0.15, random_state = 100, stratify = ytrain)

lgb_params = {'objective'        : 'binary',
              'metric'           : 'auc',
              'learning_rate'    : 0.05,
 #             'num_leaves'       : 63,
 #             'min_child_samples': 20,
              'feature_fraction' : 0.8,
              'bagging_fraction' : 0.8,
              'bagging_freq'     : 5,
              'verbose'          : -1,
              'n_jobs'           : -1,}

lgb_model = lgb.train(lgb_params,
                      lgb.Dataset(x_tr,
                                  label = y_tr,
                                  feature_name = FEATURE_COLS),
                                  num_boost_round = 500,
                                  valid_sets = [lgb.Dataset(x_val, label = y_val)],
                                  callbacks = [lgb.early_stopping(20, verbose = False), lgb.log_evaluation(50)],)

del xtrain, ytrain, x_tr, x_val, y_tr, y_val; gc.collect()

print(f'LightGBM trained  best_iter = {lgb_model.best_iteration}')
[50]	valid_0's auc: 0.85308
[100]	valid_0's auc: 0.857502
[150]	valid_0's auc: 0.860966
[200]	valid_0's auc: 0.864246
[250]	valid_0's auc: 0.866321
[300]	valid_0's auc: 0.868303
[350]	valid_0's auc: 0.869454
[400]	valid_0's auc: 0.871189
[450]	valid_0's auc: 0.872391
[500]	valid_0's auc: 0.87376
LightGBM trained  best_iter = 500

Back to top..


7. S6 architecture & cold-start gate

# Apply function to check for cold starts
cold_n = sum(is_cold(uid) for uid in eval_fold_users)
print(f'Cold in eval fold: {cold_n:,}  ({100*cold_n/len(eval_fold_users):.1f}%)')
Cold in eval fold: 311  (1.1%)

Stage 1 β€” Expanded candidate pool

# Get stage 1 candidates and diagnostics
N_STAGE1 = 200
DIAG_N = 500

diag_users = eval_warm.sample(n = DIAG_N, random_state = 100)
recalls = []

for _, row in diag_users.iterrows():

    pool = set(stage1_candidates(row['userId']))
    true = row['true_items']
    recalls.append(len(pool & true) / len(true) if true else 0.0)

print(f'Stage-1 Recall@{N_STAGE1} (n={DIAG_N}): {np.mean(recalls):.4f}')
Stage-1 Recall@200 (n=500): 0.5524

Stage 2 β€” Meta-ranker training

# Init params
STAGE2_FEATURE_COLS = FEATURE_COLS + ['in_s2','in_s3','in_s4', 'rank_s2','rank_s3','rank_s4', 'n_retrievers', 's5_score',]   # base LightGBM predicted probability β€” strongest available signal
META_TRAIN_USERS = 8000  # doubled from 4000
META_NEG_RATIO   = 5

meta_users = rng.choice(list(train_fold_users & warm_users), size = min(META_TRAIN_USERS, len(train_fold_users & warm_users)), replace = False,)
# ── Batch-precompute S2 and S4 genre scores for meta training ───────────
# Same optimisation as the base LGB pairs cell: replace per-user matmuls
# with two batch matmuls before the loop.
uga_meta  = user_genre_affinity.reindex(meta_users).fillna(0).values.astype('float32')
uga_norms = np.linalg.norm(uga_meta, axis=1, keepdims=True).clip(min=1e-9)
s2_scores_meta = movie_genre_norm @ (uga_meta / uga_norms).T  # (n_movies, n_meta_users)
del uga_meta, uga_norms; gc.collect()

taste_meta = np.vstack([user_taste_vec.get(uid, np.zeros(len(genre_cols), dtype='float32')) for uid in meta_users]).astype('float32')
s4_scores_meta = movie_genre_norm @ taste_meta.T  # (n_movies, n_meta_users)
del taste_meta; gc.collect()

print(f'Meta score matrices: {s2_scores_meta.shape}  '
      f'({(s2_scores_meta.nbytes + s4_scores_meta.nbytes)/1e6:.0f} MB combined)')

# ── Collect (uid, movie, label, ranks) ───────────────────────────────────
pair_rows = []
t0 = time.time()

for i, uid in enumerate(meta_users):

    # S2: slice precomputed column, partition top-N_STAGE1, filter seen
    seen   = get_seen(uid)
    s2_col = s2_scores_meta[:, i]
    s2_top = np.argpartition(s2_col, -N_STAGE1)[-N_STAGE1:]
    s2_top = s2_top[np.argsort(s2_col[s2_top])[::-1]]
    s2l    = [genre_index_order[j] for j in s2_top
              if genre_index_order[j] not in seen][:N_STAGE1]

    # S3: item-CF stays sequential
    s3l = s3_item_cf(uid, K=N_STAGE1)

    # S4: slice precomputed column, partition top-N_STAGE1, filter seen
    s4_col = s4_scores_meta[:, i]
    s4_top = np.argpartition(s4_col, -N_STAGE1)[-N_STAGE1:]
    s4_top = s4_top[np.argsort(s4_col[s4_top])[::-1]]
    s4l    = [genre_index_order[j] for j in s4_top if genre_index_order[j] not in seen][:N_STAGE1]

    cands = list(dict.fromkeys(s2l + s3l + s4l))

    # Positives: candidates that appear in the user's held-out ground truth.
    # Using ground_truth (test-set labels) rather than user_liked (training-set
    # liked items) ensures positives are present in the retrieval pool and
    # therefore carry meaningful retriever rank features (in_s2/s3/s4, rank_s*).
    true_set = ground_truth[uid] if uid in ground_truth.index else set()
    pos_c = [m for m in cands if m in true_set]

    if not pos_c:

        continue

    # Negatives: candidates not in the user's true set
    neg_c = [m for m in cands if m not in true_set]
    n_neg = min(len(pos_c) * META_NEG_RATIO, len(neg_c))
    neg_s = rng.choice(neg_c, size=n_neg, replace=False).tolist() if n_neg > 0 else []
    all_c = pos_c + neg_s

    if not all_c:

        continue

    s2r = {m: r for r, m in enumerate(s2l)}
    s3r = {m: r for r, m in enumerate(s3l)}
    s4r = {m: r for r, m in enumerate(s4l)}

    for m in all_c:

        pair_rows.append((uid,
                          m,
                          int(m in true_set),
                          int(m in s2r),
                          int(m in s3r),
                          int(m in s4r),
                          s2r.get(m, N_STAGE1),
                          s3r.get(m, N_STAGE1),
                          s4r.get(m, N_STAGE1),))

    if (i + 1) % 500 == 0:

        print(f'  {i+1}/{len(meta_users)}  {time.time()-t0:.0f}s')

del s2_scores_meta, s4_scores_meta; gc.collect()
print(f'Retrieval done: {len(pair_rows):,} pairs in {time.time()-t0:.0f}s')
Meta score matrices: (77369, 8000)  (4952 MB combined)
  500/8000  5s
  1000/8000  10s
  1500/8000  14s
  2000/8000  19s
  2500/8000  24s
  3000/8000  28s
  3500/8000  33s
  4000/8000  38s
  4500/8000  43s
  5000/8000  47s
  5500/8000  53s
  6000/8000  57s
  6500/8000  62s
  7000/8000  67s
  7500/8000  72s
  8000/8000  77s
Retrieval done: 188,689 pairs in 80s
# Build the data from the results
pairs = pd.DataFrame(pair_rows, columns = ['userId','movieId','label', 'in_s2','in_s3','in_s4', 'rank_s2','rank_s3','rank_s4',])
pairs['n_retrievers'] = (pairs['in_s2'] + pairs['in_s3'] + pairs['in_s4']).astype('int8')
pairs[['in_s2','in_s3','in_s4']]         = pairs[['in_s2','in_s3','in_s4']].astype('int8')
pairs[['rank_s2','rank_s3','rank_s4']]   = pairs[['rank_s2','rank_s3','rank_s4']].astype('int16')
# Join user features
u_feat = user_stats[['rating_count','mean_rating','rating_freq','rating_std']].rename(columns = {'rating_count': 'u_rating_count',
                                                                                                 'mean_rating' : 'u_mean_rating',
                                                                                                 'rating_freq' : 'u_rating_freq',
                                                                                                 'rating_std'  : 'u_rating_std',})
pairs = pairs.merge(u_feat, left_on = 'userId', right_index = True, how = 'left')
# Join movie features
m_feat = movie_feat[['global_count','log_count','global_mean','bayesian_score', 'rating_variance','tag_count']].rename(columns = {'global_count'   : 'm_global_count',
                                                                                                                                  'log_count'      : 'm_log_count',
                                                                                                                                  'global_mean'    : 'm_global_mean',
                                                                                                                                  'bayesian_score' : 'm_bayesian',
                                                                                                                                  'rating_variance': 'm_rating_variance',
                                                                                                                                  'tag_count'      : 'm_tag_count',})

pairs = pairs.merge(m_feat, left_on = 'movieId', right_index = True, how = 'left')
# Join the vectorized genre affinity
taste_matrix    = np.vstack([user_taste_vec.get(uid, np.zeros(len(genre_cols), dtype = 'float32')) for uid in pairs['userId']]).astype('float32')
genre_matrix_raw = movie_genre.reindex(pairs['movieId']).fillna(0).values.astype('float32')
genre_norms      = np.linalg.norm(genre_matrix_raw, axis=1, keepdims=True).clip(min=1e-9)
genre_matrix     = genre_matrix_raw / genre_norms
pairs['genre_affinity'] = (taste_matrix * genre_matrix).sum(axis=1).astype('float32')
del taste_matrix, genre_matrix_raw, genre_norms, genre_matrix; gc.collect()
0
pairs.head()
userIdmovieIdlabelin_s2in_s3in_s4rank_s2rank_s3rank_s4n_retrieversu_rating_countu_mean_ratingu_rating_frequ_rating_stdm_global_countm_log_countm_global_meanm_bayesianm_rating_variancem_tag_countgenre_affinity
01820411196101020002001193.578919.00000.76856845511.13394.13514.13470.91127.53480.8041
11820419516710102001042001193.578919.00000.768586329.06333.48223.48260.91535.95060.7783
21820418633200102001672001193.578919.00000.7685129759.47093.31883.31971.02977.13010.8060
31820415900010200702001193.578919.00000.76854419510.69643.72773.72750.91895.95840.5203
41820413783001111418463193.578919.00000.768517997.49553.38943.39361.13384.79580.8427
pairs.isnull().sum()
0
userId0
movieId0
label0
in_s20
in_s30
in_s40
rank_s20
rank_s30
rank_s40
n_retrievers0
u_rating_count0
u_mean_rating0
u_rating_freq0
u_rating_std0
m_global_count0
m_log_count0
m_global_mean0
m_bayesian0
m_rating_variance0
m_tag_count0
genre_affinity0


# Fill missing values
#for col, default in [('u_rating_count', 0), ('u_mean_rating', GLOBAL_MEAN),
#                     ('u_rating_freq',  0), ('u_rating_std',  0),
#                     ('m_global_count', 0), ('m_log_count',   0),
#                     ('m_global_mean',  GLOBAL_MEAN), ('m_bayesian', GLOBAL_MEAN),
#                     ('m_rating_variance', 0), ('m_tag_count', 0)]:#

#    pairs[col] = pairs[col].fillna(default).astype('float32')

# Compute S5 predicted probability for each candidate.
# This is the single strongest per-(user,movie) relevance signal;
# the meta-ranker can learn to weight it, correct it with retriever
# rank signals, or override it for users where CF is more reliable.
pairs['s5_score'] = lgb_model.predict(pairs[FEATURE_COLS].values).astype('float32')

meta_train_df = pairs[STAGE2_FEATURE_COLS + ['label']].reset_index(drop = True)
print(f'Meta set: {meta_train_df.shape}  pos rate: {meta_train_df["label"].mean():.2%}')
Meta set: (188689, 20)  pos rate: 16.68%
meta_train_df.head()
u_rating_countu_mean_ratingu_rating_frequ_rating_stdm_global_countm_log_countm_global_meanm_bayesianm_rating_variancem_tag_countgenre_affinityin_s2in_s3in_s4rank_s2rank_s3rank_s4n_retrieverss5_scorelabel
0193.578919.00000.76856845511.13394.13514.13470.91127.53480.8041010200020010.81621
1193.578919.00000.768586329.06333.48223.48260.91535.95060.778301020010420010.25771
2193.578919.00000.7685129759.47093.31883.31971.02977.13010.806001020016720010.18250
3193.578919.00000.76854419510.69643.72773.72750.91895.95840.52030102007020010.51740
4193.578919.00000.768517997.49553.38943.39361.13384.79580.842711114184630.06260
# Train the meta-ranker models with a proper held-out validation split
xmeta = meta_train_df[STAGE2_FEATURE_COLS].values
ymeta = meta_train_df['label'].values

# Hold out 15% for validation to enable meaningful early stopping
xm_tr, xm_val, ym_tr, ym_val = train_test_split(
    xmeta, ymeta, test_size=0.15, random_state=42, stratify=ymeta)

meta_lgb_params = {'objective':'binary',
                   'metric':'auc',
#                   'num_leaves':127,
                   'learning_rate':0.03,
                   'feature_fraction':0.8,
                   'bagging_fraction':0.8,
                   'bagging_freq':5,
                   'verbose':-1,
                   'n_jobs':-1,}

meta_lgb = lgb.train(meta_lgb_params,
                     lgb.Dataset(xm_tr,
                                 label = ym_tr,
                                 feature_name = STAGE2_FEATURE_COLS),
                                 num_boost_round = 500,
                                 valid_sets = [lgb.Dataset(xm_val, label = ym_val)],
                                 callbacks = [lgb.early_stopping(30,verbose = False), lgb.log_evaluation(100)],)

print(f'Meta-LightGBM trained  best_iter={meta_lgb.best_iteration}')

meta_xgb = XGBClassifier(learning_rate = 0.03,
                         #max_depth = 6,
                         #n_estimators = 400,
                         subsample = 0.8,
                         colsample_bytree = 0.8,
                         eval_metric = 'auc',
                         early_stopping_rounds = 30,
                         verbosity = 0,)

meta_xgb.fit(xm_tr, ym_tr, eval_set = [(xm_val, ym_val)], verbose=False)
print(f'Meta-XGBoost trained  best_iter={meta_xgb.best_iteration}')
[100]	valid_0's auc: 0.856981
[200]	valid_0's auc: 0.859101
[300]	valid_0's auc: 0.860041
[400]	valid_0's auc: 0.860901
Meta-LightGBM trained  best_iter=441
Meta-XGBoost trained  best_iter=99
del xmeta, ymeta, xm_tr, xm_val, ym_tr, ym_val, meta_train_df, pairs, pair_rows; gc.collect()
28

Back to top..


10. Full benchmark: S1 β†’ S7

# Verify _seen_cache is populated (built after the user split; confirmed here)
print(f'_seen_cache: {len(_seen_cache):,} users  |  '
      f'{sum(len(v) for v in _seen_cache.values()):,} total seen entries')
_seen_cache: 200,880 users  |  28,583,599 total seen entries
%%time

# ── Batch-precompute S2 and S4 score vectors for all eval_warm users ─────
# evaluate_strategy calls s2_personal and s4_temporal per user, each firing
# a (77KΓ—19)@(19,) matmul. For 2,000 users Γ— 7 strategies Γ— 2 Ks that is
# 56,000 individual BLAS calls β€” and S5/S6/S7 call S2+S4 internally, tripling
# the count further. Two batch matmuls replace all of them.
eval_user_ids  = eval_warm['userId'].values

uga_eval   = user_genre_affinity.reindex(eval_user_ids).fillna(0).values.astype('float32')
uga_norms  = np.linalg.norm(uga_eval, axis = 1, keepdims = True).clip(min = 1e-9)
s2_matrix  = movie_genre_norm @ (uga_eval / uga_norms).T
del uga_eval, uga_norms

taste_eval = np.vstack([user_taste_vec.get(uid, np.zeros(len(genre_cols), dtype='float32')) for uid in eval_user_ids]).astype('float32')
s4_matrix  = movie_genre_norm @ taste_eval.T
del taste_eval

# Populate the score caches so s2_personal / s4_temporal skip the matmul
_s2_score_cache = {uid: s2_matrix[:, i] for i, uid in enumerate(eval_user_ids)}
_s4_score_cache = {uid: s4_matrix[:, i] for i, uid in enumerate(eval_user_ids)}
del s2_matrix, s4_matrix; gc.collect()
print(f'Score caches populated for {len(_s2_score_cache):,} eval users')

# Perform the benchmark analysis
print('Initiating Benchmark Analysis..')

STRATEGIES = {'S1: Popularity'      : s1_popularity,
              'S2: Personal'        : s2_personal,
              'S3: Item-CF'         : s3_item_cf,
              'S4: Temporal'        : s4_temporal,
              'S5: LightGBM'        : s5_lgbm,
              'S6: XGBoost'         : s6_xgboost,
              'S7: Master Ensemble' : s7_ensemble,}

# All five metrics are reported; F1 is retained for display but excluded from
# the composite score. F1 = HM(Precision, Recall) β€” including it alongside
# Precision and Recall gives the P/R axis 3x the weight of NDCG and Hit Rate
# in a simple average, making the composite uninterpretable as a balanced
# summary. The composite uses only the four non-redundant metrics.
metric_keys    = ['precision', 'recall', 'f1', 'ndcg', 'hit_rate']
composite_keys = ['precision', 'recall', 'ndcg', 'hit_rate']

all_results = {}

for K in [5, 10]:

    print(f'\n-- K = {K} --')

    for name, fn in STRATEGIES.items():

        t0  = time.time()
        res = evaluate_strategy(fn, eval_warm, K = K, n = 2000, seed = 100)
        res['composite'] = float(np.mean([res[m] for m in composite_keys]))
        all_results[(name, K)] = res

        print(f'  {name:<25}  composite = {res["composite"]:.4f}  '
              f'NDCG = {res["ndcg"]:.4f}  HR = {res["hit_rate"]:.4f}  ({time.time()-t0:.1f}s)')

# Clear caches after benchmark β€” they are specific to eval_warm users
_s2_score_cache.clear()
_s4_score_cache.clear()
print('Benchmark Analysis complete..')
Score caches populated for 28,509 eval users

-- K = 5 --
  S1: Popularity             composite = 0.0671  NDCG = 0.0457  HR = 0.1545  (5.1s)
  S2: Personal               composite = 0.0029  NDCG = 0.0016  HR = 0.0075  (8.9s)
  S3: Item-CF                composite = 0.1094  NDCG = 0.0808  HR = 0.2375  (2.4s)
  S4: Temporal               composite = 0.0021  NDCG = 0.0014  HR = 0.0055  (9.2s)
  S5: LightGBM               composite = 0.0231  NDCG = 0.0147  HR = 0.0560  (131.2s)
  S6: XGBoost                composite = 0.1186  NDCG = 0.0828  HR = 0.2630  (187.6s)
  S7: Master Ensemble        composite = 0.1230  NDCG = 0.0875  HR = 0.2710  (162.7s)

-- K = 10 --
  S1: Popularity             composite = 0.0893  NDCG = 0.0486  HR = 0.2270  (5.2s)
  S2: Personal               composite = 0.0041  NDCG = 0.0017  HR = 0.0120  (14.3s)
  S3: Item-CF                composite = 0.1452  NDCG = 0.0882  HR = 0.3475  (2.4s)
  S4: Temporal               composite = 0.0038  NDCG = 0.0016  HR = 0.0110  (13.7s)
  S5: LightGBM               composite = 0.0296  NDCG = 0.0151  HR = 0.0805  (134.0s)
  S6: XGBoost                composite = 0.1616  NDCG = 0.0939  HR = 0.3905  (189.3s)
  S7: Master Ensemble        composite = 0.1652  NDCG = 0.0992  HR = 0.3930  (161.3s)
CPU times: user 57min 8s, sys: 29 s, total: 57min 37s
Wall time: 17min 15s
# Compile the leaderboard
records = []

for (name, K), res in all_results.items():

    records.append({'strategy': name, 'K': K, **res})

leaderboard = (pd.DataFrame(records).sort_values(['K','composite'], ascending = [True, False]).reset_index(drop=True))
leaderboard[['K','strategy'] + metric_keys + ['composite']]
Kstrategyprecisionrecallf1ndcghit_ratecomposite
05S7: Master Ensemble0.07270.06080.05520.08750.27100.1230
15S6: XGBoost0.07070.05800.05300.08280.26300.1186
25S3: Item-CF0.06570.05380.04920.08080.23750.1094
35S1: Popularity0.03840.02980.02750.04570.15450.0671
45S5: LightGBM0.01180.01000.00940.01470.05600.0231
55S2: Personal0.00160.00100.00110.00160.00750.0029
65S4: Temporal0.00110.00060.00060.00140.00550.0021
710S7: Master Ensemble0.06480.10380.06660.09920.39300.1652
810S6: XGBoost0.06220.10010.06410.09390.39050.1616
910S3: Item-CF0.05590.08930.05740.08820.34750.1452
1010S1: Popularity0.03230.04920.03240.04860.22700.0893
1110S5: LightGBM0.00880.01390.00910.01510.08050.0296
1210S2: Personal0.00130.00160.00120.00170.01200.0041
1310S4: Temporal0.00110.00160.00110.00160.01100.0038

11. Benchmark visualisations


# Composite scores at K = 10
fig, ax = plt.subplots(figsize=(10, 5))
lb10 = leaderboard[leaderboard['K'] == 10].sort_values('composite')
bars = ax.barh(lb10['strategy'], lb10['composite'], color = sns.color_palette('husl', len(lb10)))
ax.set_xlabel('Composite Score (mean of P, R, F1, NDCG, HR)')
ax.set_title('Strategy Comparison - MovieLens 32M  |  K = 10')
ax.bar_label(bars, fmt='{:.4f}', padding=4)
plt.tight_layout()
plt.savefig('results_01_composite_k10.png', dpi=150, bbox_inches='tight')
plt.show()

png

# ── 12-3  Per-metric breakdown (K=10) ────────────────────────────────────────────
lb10 = leaderboard[leaderboard['K'] == 10].set_index('strategy')[metric_keys]
lb10.T.plot(kind='bar', figsize=(12, 5), colormap='tab10', edgecolor='white')
plt.title('Per-Metric Breakdown - MovieLens 32M  |  K=10')
plt.ylabel('Score'); plt.xticks(rotation=0)
plt.legend(loc='upper right', fontsize=8)
plt.tight_layout()
plt.savefig('results_03_per_metric.png', dpi=150, bbox_inches='tight')
plt.show()

png

# ── 12-4  NDCG vs Hit-Rate scatter ───────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(8, 5))
for _, row in leaderboard[leaderboard['K'] == 10].iterrows():
    ax.scatter(row['hit_rate'], row['ndcg'], s=140, zorder=3)
    ax.annotate(row['strategy'], (row['hit_rate'], row['ndcg']),
                textcoords='offset points', xytext=(6, 3), fontsize=8)
ax.set_xlabel('Hit Rate @ K=10'); ax.set_ylabel('NDCG @ K=10')
ax.set_title('Ranking Quality vs Coverage')
plt.tight_layout()
plt.savefig('results_04_ndcg_hr_scatter.png', dpi=150, bbox_inches='tight')
plt.show()

png

# Get ft imp for the stage 2 ranker model
imp_df = pd.DataFrame({'feature'   : STAGE2_FEATURE_COLS,
                       'importance': meta_lgb.feature_importance(importance_type='gain'),}).sort_values('importance', ascending=True)

fig, ax = plt.subplots(figsize=(8, 5))
ax.barh(imp_df['feature'], imp_df['importance'], color = sns.color_palette('husl', len(imp_df)))
ax.set_xlabel('Gain')
ax.set_title('LightGBM Feature Importance (gain) - Stage 2 Meta-Ranker')
plt.tight_layout()
plt.savefig('results_05_feature_importance.png', dpi=150, bbox_inches='tight')
plt.show()

png

%%time

# Repopulate score caches for eval_df users (caches were cleared after the warm benchmark; eval_df is a superset that includes cold-start users too).
cold_sample   = eval_df.sample(n = 2000, random_state = 100)
cold_user_ids = cold_sample['userId'].values

uga_cold  = user_genre_affinity.reindex(cold_user_ids).fillna(0).values.astype('float32')
uga_norms = np.linalg.norm(uga_cold, axis = 1, keepdims = True).clip(min = 1e-9)
s2_cold   = movie_genre_norm @ (uga_cold / uga_norms).T

del uga_cold, uga_norms

taste_cold = np.vstack([user_taste_vec.get(uid, np.zeros(len(genre_cols), dtype='float32')) for uid in cold_user_ids]).astype('float32')
s4_cold = movie_genre_norm @ taste_cold.T
del taste_cold

_s2_score_cache = {uid: s2_cold[:, i] for i, uid in enumerate(cold_user_ids)}
_s4_score_cache = {uid: s4_cold[:, i] for i, uid in enumerate(cold_user_ids)}
del s2_cold, s4_cold; gc.collect()

cold_results = {}

for name, fn in STRATEGIES.items():

    res = evaluate_strategy(fn, cold_sample, K = 10, n = None)
    res['composite'] = float(np.mean([res[m] for m in composite_keys]))
    cold_results[name] = res

_s2_score_cache.clear()
_s4_score_cache.clear()
CPU times: user 27min 38s, sys: 3.06 s, total: 27min 41s
Wall time: 8min 13s

Back to top..


12. Leaderboard & Takeaways

# Master
for k_val in [5, 10]:

    print(f'{"="*60}')
    print(f'  LEADERBOARD @ K={k_val}')
    print('='*60)
    lb = leaderboard[leaderboard['K'] == k_val].reset_index(drop = True)
    lb.index += 1
    print(lb[['strategy','precision','recall','f1','ndcg','hit_rate','composite']].to_string())
    print(f'\n  Winner @ K={k_val}: {lb.iloc[0]["strategy"]}\n')
============================================================
  LEADERBOARD @ K=5
============================================================
              strategy  precision  recall     f1   ndcg  hit_rate  composite
1  S7: Master Ensemble     0.0727  0.0608 0.0552 0.0875    0.2710     0.1230
2          S6: XGBoost     0.0707  0.0580 0.0530 0.0828    0.2630     0.1186
3          S3: Item-CF     0.0657  0.0538 0.0492 0.0808    0.2375     0.1094
4       S1: Popularity     0.0384  0.0298 0.0275 0.0457    0.1545     0.0671
5         S5: LightGBM     0.0118  0.0100 0.0094 0.0147    0.0560     0.0231
6         S2: Personal     0.0016  0.0010 0.0011 0.0016    0.0075     0.0029
7         S4: Temporal     0.0011  0.0006 0.0006 0.0014    0.0055     0.0021

  Winner @ K=5: S7: Master Ensemble

============================================================
  LEADERBOARD @ K=10
============================================================
              strategy  precision  recall     f1   ndcg  hit_rate  composite
1  S7: Master Ensemble     0.0648  0.1038 0.0666 0.0992    0.3930     0.1652
2          S6: XGBoost     0.0622  0.1001 0.0641 0.0939    0.3905     0.1616
3          S3: Item-CF     0.0559  0.0893 0.0574 0.0882    0.3475     0.1452
4       S1: Popularity     0.0323  0.0492 0.0324 0.0486    0.2270     0.0893
5         S5: LightGBM     0.0088  0.0139 0.0091 0.0151    0.0805     0.0296
6         S2: Personal     0.0013  0.0016 0.0012 0.0017    0.0120     0.0041
7         S4: Temporal     0.0011  0.0016 0.0011 0.0016    0.0110     0.0038

  Winner @ K=10: S7: Master Ensemble
# S5 v S6
#print('='*60)
#print('S7 vs S5 -- head-to-head')
#print('='*60)

#for k_val in [5, 10]:

#    r5 = all_results[('S5: LightGBM',        k_val)]
#    r6 = all_results[('S7: Master Ensemble',  k_val)]

#    print(f'\n  @ K={k_val}')
#    print(f'  {"Metric":<12}  {"S5":>8}  {"S6":>8}  {"Delta":>8}  {"% chg":>8}')
#    print(f'  {"-"*50}')

#    for mk in metric_keys:

#        v5, v6 = r5[mk], r6[mk]
#        d   = v6 - v5
#        pct = d / v5 * 100 if v5 > 0 else 0.0
#        arr = 'up' if d > 0 else ('dn' if d < 0 else '==')
#        print(f'  {mk:<12}  {v5:8.4f}  {v6:8.4f}  {d:+8.4f}  {arr} {pct:+.2f}%')

Back to top..


Download ml-32m.zip from https://grouplens.org/datasets/movielens/32m/. Cite: Harper & Konstan (2015), ACM TiiS 5(4):19.