Comprehensive Network Mapping of Netflix in Python - Part II

42 minute read

Published: March 16, 2026

This post is a continuation of the previous in which we used a dataset posted on Kaggle as fodder for creating a vast network of communities of actors and directors spread around the world! In this post we will explore the network dynamics of that information.

🎬 Mapping the Netflix Cinematic Graph Part II

A Network Science Exploration of Film Relationships

📦 Dataset: Netflix Titles — Kaggle (Shivamb)

In the first section of this project we constructed and visualized a collaboration network derived from the Netflix catalog. Each node in these networks represents a person (either an actor or a director), and edges represent collaborations where individuals worked together on the same title.

While visualization provides an intuitive understanding of the network structure, graph theory allows us to go further. Network analysis enables us to quantify structural properties of the collaboration graph and identify important individuals, patterns of connectivity, and communities within the network.

In this section we will analyze the collaboration network using a variety of tools from network science.

Why Analyze the Network?

Large collaboration networks often contain hidden structural patterns that are difficult to identify through visualization alone. By applying network metrics, we can answer questions such as:

• Who are the most connected individuals in the Netflix collaboration network?
• Which actors or directors act as bridges between otherwise separate communities?
• How densely connected is the network overall?
• Are there clusters corresponding to different film industries or collaboration circles?

Network analysis provides quantitative answers to these questions and helps reveal the underlying structure of the streaming entertainment ecosystem.

Graph Representation

The collaboration network used in this analysis is constructed as a person–person graph.

Nodes represent individuals involved in the production of Netflix titles:

• actors
• directors

Edges represent collaborative relationships:

• two individuals are connected if they appeared together on the same Netflix title

Edge weights represent the number of shared titles between two individuals, allowing us to distinguish occasional collaborations from frequent partnerships.

This structure transforms the Netflix catalog into a social network of creative collaboration.

Analysis Roadmap

The following analyses will be performed in this section:

Global Graph Statistics
Basic properties of the network including size, density, and connected components.
Centrality Metrics
Identifying the most influential or well-connected individuals using measures such as:
- degree centrality
- betweenness centrality
- closeness centrality
Community Detection
Discovering clusters of individuals who frequently collaborate with each other.
Collaboration Patterns
Examining how collaborations are distributed across the network.
Ego Networks
Exploring the collaboration neighborhood surrounding specific actors or directors.

Together, these analyses provide a deeper understanding of how creative collaborations are structured within the Netflix catalog.

!pip install --upgrade colorcet dash gensim datashader "holoviews[recommended]" jupyter_bokeh bokeh python-louvain -q

# Import libraries
import pandas as pd
import collections
import networkx as nx
import colorcet as cc
import plotly.graph_objects as go
from dash import Dash, dcc, html, Input, Output
import random
from IPython.display import IFrame, display, Image
from holoviews.operation.datashader import (datashade, aggregate, dynspread, bundle_graph, split_dataframe,  regrid)
from holoviews.element.graphs import layout_nodes
from datashader.layout import forceatlas2_layout, random_layout, circular_layout
import holoviews as hv
import datashader as ds
import datashader.transfer_functions as tf
from datashader.bundling import connect_edges, hammer_bundle
hv.extension("bokeh")
import numpy as np
#from google.colab import drive
#drive.mount('/content/drive')
import time
from holoviews import opts
import math
from scipy.interpolate import splprep, splev
from itertools import combinations
import string
translator = str.maketrans('', '', string.punctuation)
import community as louvain_community
from bokeh.io import output_notebook
output_notebook()
from matplotlib.colors import ListedColormap
import panel as pn
pn.extension()
import colorsys
import matplotlib.colors as mcolors
from typing import List
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from PIL import Image as PILImage
import gc
import io
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

</img>

<style>
    .bk-notebook-logo {
        display: block;
        width: 20px;
        height: 20px;
        background-image: url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAUCAYAAACNiR0NAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAABx0RVh0U29mdHdhcmUAQWRvYmUgRmlyZXdvcmtzIENTNui8sowAAAOkSURBVDiNjZRtaJVlGMd/1/08zzln5zjP1LWcU9N0NkN8m2CYjpgQYQXqSs0I84OLIC0hkEKoPtiH3gmKoiJDU7QpLgoLjLIQCpEsNJ1vqUOdO7ppbuec5+V+rj4ctwzd8IIbbi6u+8f1539dt3A78eXC7QizUF7gyV1fD1Yqg4JWz84yffhm0qkFqBogB9rM8tZdtwVsPUhWhGcFJngGeWrPzHm5oaMmkfEg1usvLFyc8jLRqDOMru7AyC8saQr7GG7f5fvDeH7Ej8CM66nIF+8yngt6HWaKh7k49Soy9nXurCi1o3qUbS3zWfrYeQDTB/Qj6kX6Ybhw4B+bOYoLKCC9H3Nu/leUTZ1JdRWkkn2ldcCamzrcf47KKXdAJllSlxAOkRgyHsGC/zRday5Qld9DyoM4/q/rUoy/CXh3jzOu3bHUVZeU+DEn8FInkPBFlu3+nW3Nw0mk6vCDiWg8CeJaxEwuHS3+z5RgY+YBR6V1Z1nxSOfoaPa4LASWxxdNp+VWTk7+4vzaou8v8PN+xo+KY2xsw6une2frhw05CTYOmQvsEhjhWjn0bmXPjpE1+kplmmkP3suftwTubK9Vq22qKmrBhpY4jvd5afdRA3wGjFAgcnTK2s4hY0/GPNIb0nErGMCRxWOOX64Z8RAC4oCXdklmEvcL8o0BfkNK4lUg9HTl+oPlQxdNo3Mg4Nv175e/1LDGzZen30MEjRUtmXSfiTVu1kK8W4txyV6BMKlbgk3lMwYCiusNy9fVfvvwMxv8Ynl6vxoByANLTWplvuj/nF9m2+PDtt1eiHPBr1oIfhCChQMBw6Aw0UulqTKZdfVvfG7VcfIqLG9bcldL/+pdWTLxLUy8Qq38heUIjh4XlzZxzQm19lLFlr8vdQ97rjZVOLf8nclzckbcD4wxXMidpX30sFd37Fv/GtwwhzhxGVAprjbg0gCAEeIgwCZyTV2Z1REEW8O4py0wsjeloKoMr6iCY6dP92H6Vw/oTyICIthibxjm/DfN9lVz8IqtqKYLUXfoKVMVQVVJOElGjrnnUt9T9wbgp8AyYKaGlqingHZU/uG2NTZSVqwHQTWkx9hxjkpWDaCg6Ckj5qebgBVbT3V3NNXMSiWSDdGV3hrtzla7J+duwPOToIg42ChPQOQjspnSlp1V+Gjdged7+8UN5CRAV7a5EdFNwCjEaBR27b3W890TE7g24NAP/mMDXRWrGoFPQI9ls/MWO2dWFAar/xcOIImbbpA3zgAAAABJRU5ErkJggg==);
    }
</style>
<div>
    <a href="https://bokeh.org" target="_blank" class="bk-notebook-logo"></a>
    <span id="c9cce3f6-4bda-4693-babf-7b34a8d16506">Loading BokehJS ...</span>
</div>

# Functions to work with datashader
cvsopts = dict(plot_height = 800, plot_width = 800)

def nodesplot(nodes, name = None, canvas = None, cat = None):

    if canvas is None:

        eps = 1e-6
        xr = (float(nodes.x.min()) - eps, float(nodes.x.max()) + eps)
        yr = (float(nodes.y.min()) - eps, float(nodes.y.max()) + eps)
        canvas = ds.Canvas(x_range = xr, y_range = yr, **cvsopts)

    # Strip cats before mapping
    if cat:

        nodes[cat] = nodes[cat].cat.remove_unused_categories()

    aggregator = None if cat is None else ds.count_cat(cat)
    agg = canvas.points(nodes,'x','y',aggregator)

    if cat:

        cats = list(nodes[cat].cat.categories)
        color_key = dict(zip(cats, make_hex_palette(len(cats))))

        return tf.dynspread(tf.shade(agg, color_key = color_key, name = name))

    return tf.dynspread(tf.shade(agg, cmap = ["#FF3333"], name = name))

def edgesplot(edges, name = None, canvas = None):

    if canvas is None:

        eps = 1e-6
        xr = (float(edges.x.min()) - eps, float(edges.x.max()) + eps)
        yr = (float(edges.y.min()) - eps, float(edges.y.max()) + eps)
        canvas = ds.Canvas(x_range = xr, y_range = yr, **cvsopts)

    return tf.shade(canvas.line(edges, 'x','y', agg = ds.count()), name = name)

def graphplot(nodes, edges, name = "", canvas = None, cat = None, pad = 0.025):

    if canvas is None:

        xmin, xmax = nodes.x.min(), nodes.x.max()
        ymin, ymax = nodes.y.min(), nodes.y.max()

        xpad = (xmax - xmin) * pad
        ypad = (ymax - ymin) * pad

        eps = 1e-6
        xr = (float(xmin - xpad) - eps, float(xmax + xpad) + eps)
        yr = (float(ymin - ypad) - eps, float(ymax + ypad) + eps)

        canvas = ds.Canvas(x_range = xr, y_range = yr, **cvsopts)

    nodeplot = nodesplot(nodes, name + " nodes", canvas, cat)
    edgeplot = edgesplot(edges, name + " edges", canvas)

    return tf.stack(edgeplot, nodeplot, how = "over", name = name)

# Function to create a colormap
def make_hex_palette(n):

    """High-contrast palette optimised for dark backgrounds."""
    result = []

    for i in range(n):

        h = i / n

        # lightness=0.65, saturation=0.95 — bright, vivid, dark-bg friendly
        r, g, b = colorsys.hls_to_rgb(h, 0.65, 0.95)
        result.append('#{:02x}{:02x}{:02x}'.format(int(r * 255), int(g * 255), int(b * 255)))

    return result

# Function to visualize colormap
def plot_colortable(hex_colors: List[str]):

    """Creates a colorbar using custom hex colors."""
    cmap = mcolors.ListedColormap(hex_colors)

    plt.figure(figsize = (8, 2), dpi = 150)
    plt.imshow([list(range(len(hex_colors)))], cmap = cmap, aspect = 'auto')
    plt.axis('off')
    plt.show()

def show_ds_images(images, titles, ncols = 2, fig_width = 18, cell_px = 800, bg = 'black', title_color = 'white', title_size = 11):

    """
    Display a list of datashader images with styled titles and background.

    Parameters
    ----------
    images : list of datashader Image objects
    titles : list of str
    ncols  : number of columns in the grid
    fig_width : total figure width in inches
    cell_px : resolution to render each image (square)
    bg     : hex background colour applied to each image AND the figure
    """

    nrows = -(-len(images) // ncols)
    cell_in = fig_width / ncols
    fig, axes = plt.subplots(nrows, ncols, figsize = (fig_width, cell_in * nrows), facecolor = bg, dpi = 150)
    axes = list(np.atleast_1d(axes).flat)

    for ax, img, title in zip(axes, images, titles):

        # Apply background colour, then upscale with Lanczos for sharpness
        styled = tf.set_background(img, bg)
        pil_img = styled.to_pil()
        pil_img = pil_img.resize((cell_px, cell_px), PILImage.LANCZOS)
        ax.imshow(pil_img, interpolation = 'lanczos')
        ax.set_title(title, color = title_color, fontsize = title_size, fontweight = 'bold', pad = 12, loc = 'center')
        ax.set_facecolor(bg)
        ax.axis('off')

    # Hide any unused axes
    for ax in list(axes)[len(images):]:

        ax.set_visible(False)

    plt.subplots_adjust(wspace = 0.04, hspace = 0.12)
    buf = io.BytesIO()
    plt.savefig(buf, format = 'png', bbox_inches = 'tight', facecolor = bg)
    plt.close()
    buf.seek(0)
    display(Image(data = buf.read()))
    buf.close()

def normalise_layout(df, margin = 0.05):

    """Rescale x, y to [margin, 1-margin] regardless of FA2 coordinate explosion."""
    df['x'] = (df['x'] - df['x'].min()) / (df['x'].max() - df['x'].min())
    df['y'] = (df['y'] - df['y'].min()) / (df['y'].max() - df['y'].min())
    df['x'] = df['x'] * (1 - 2*margin) + margin
    df['y'] = df['y'] * (1 - 2*margin) + margin

    return df

# Functions to map ego networks
def get_ego_network(G, name, radius = 1):

    """Extract the ego network of `name` up to `radius` hops."""
    if name not in G:

        raise ValueError(f'{name!r} not found in graph')

    ego_nodes = nx.ego_graph(G, name, radius = radius).nodes()

    return G.subgraph(ego_nodes).copy()

def ego_stats(G, ego_name):

    """Print summary statistics for an ego network."""
    ego = get_ego_network(G, ego_name)
    n   = ego.number_of_nodes()
    e   = ego.number_of_edges()
    alters = n - 1
    max_edges = alters * (alters - 1) / 2 if alters > 1 else 1
    alter_density = e / max_edges if max_edges > 0 else 0

    cc = nx.clustering(ego, ego_name)

    print(f'  Ego:             {ego_name}')
    print(f'  Alters (degree): {alters}')
    print(f'  Ego-net edges:   {e}')
    print(f'  Alter density:   {alter_density:.4f}')
    print(f'  Local clustering:{cc:.4f}')

    return ego

def plot_ego_network(G, ego_name, ax, title = None, node_size = 80):

    """Draw ego network on a given matplotlib axis."""
    ego = get_ego_network(G, ego_name)
    pos = nx.spring_layout(ego, seed = 100, k = 0.7)

    node_colors = ['#E50914' if n == ego_name else '#888888' for n in ego.nodes()]
    node_sizes  = [node_size * 4 if n == ego_name else node_size for n in ego.nodes()]

    edge_weights = [ego[u][v].get('weight', 1) for u, v in ego.edges()]
    max_w = max(edge_weights) if edge_weights else 1
    edge_widths = [0.5 + 2.0 * (w / max_w) for w in edge_weights]

    nx.draw_networkx_edges(ego, pos, ax=ax, edge_color='#cccccc', width = edge_widths, alpha = 0.7)
    nx.draw_networkx_nodes(ego, pos, ax=ax, node_color=node_colors, node_size = node_sizes, linewidths = 0.5, edgecolors = 'white')
    nx.draw_networkx_labels(ego, pos, ax=ax,labels = {ego_name: ego_name}, font_size = 8, font_color = 'white', font_weight = 'bold')

    n_alters = ego.number_of_nodes() - 1
    ax.set_title(title or f'{ego_name}\n({n_alters} direct collaborators)', fontsize = 9, fontweight = 'bold')
    ax.axis('off')

# Holoviews
kwargs = dict(width = 1000, height = 1000, xaxis = None, yaxis = None)
opts.defaults(opts.Nodes(**kwargs), opts.Graph(**kwargs))
colors = ['#000000'] + hv.Cycle('Category20').values

# Read the file using pandas
df = pd.read_csv('/Users/anon/Downloads/netflix_titles.csv')

# Split the cells to extract data
df['directors'] = df['director'].apply(lambda l: [] if pd.isna(l) else [i.strip() for i in l.split(",")])
df['categories'] = df['listed_in'].apply(lambda l: [] if pd.isna(l) else [i.strip() for i in l.split(",")])
df['actors'] = df['cast'].apply(lambda l: [] if pd.isna(l) else [i.strip() for i in l.split(",")])
df['countries'] = df['country'].apply(lambda l: [] if pd.isna(l) else [i.strip() for i in l.split(",")])

df.head()

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description	directors	categories	actors	countries
0	s1	Movie	Dick Johnson Is Dead	Kirsten Johnson	NaN	United States	September 25, 2021	2020	PG-13	90 min	Documentaries	As her father nears the end of his life, filmm...	[Kirsten Johnson]	[Documentaries]	[]	[United States]
1	s2	TV Show	Blood & Water	NaN	Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...	South Africa	September 24, 2021	2021	TV-MA	2 Seasons	International TV Shows, TV Dramas, TV Mysteries	After crossing paths at a party, a Cape Town t...	[]	[International TV Shows, TV Dramas, TV Mysteries]	[Ama Qamata, Khosi Ngema, Gail Mabalane, Thaba...	[South Africa]
2	s3	TV Show	Ganglands	Julien Leclercq	Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...	NaN	September 24, 2021	2021	TV-MA	1 Season	Crime TV Shows, International TV Shows, TV Act...	To protect his family from a powerful drug lor...	[Julien Leclercq]	[Crime TV Shows, International TV Shows, TV Ac...	[Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nab...	[]
3	s4	TV Show	Jailbirds New Orleans	NaN	NaN	NaN	September 24, 2021	2021	TV-MA	1 Season	Docuseries, Reality TV	Feuds, flirtations and toilet talk go down amo...	[]	[Docuseries, Reality TV]	[]	[]
4	s5	TV Show	Kota Factory	NaN	Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...	India	September 24, 2021	2021	TV-MA	2 Seasons	International TV Shows, Romantic TV Shows, TV ...	In a city of coaching centers known to train I...	[]	[International TV Shows, Romantic TV Shows, TV...	[Mayur More, Jitendra Kumar, Ranjan Raj, Alam ...	[India]

# Filter the data for movies with directors only
df2 = df[df['directors'].map(len) > 0]
df2 = df2[df2['actors'].map(len) > 0][['title', 'directors', 'actors']]

# ── Dataset composition ──────────────────────────────────────────
print('=== Dataset Overview ===')
print(f"Total titles:       {len(df):>7,}")
print(f"  Movies:           {(df.type=='Movie').sum():>7,}")
print(f"  TV Shows:         {(df.type=='TV Show').sum():>7,}")
print()
print(f"Titles with cast:   {df['actors'].apply(len).gt(0).sum():>7,}")
print(f"Titles with dir.:   {df['directors'].apply(len).gt(0).sum():>7,}")
print(f"Titles with both:   {(df['actors'].apply(len).gt(0) & df['directors'].apply(len).gt(0)).sum():>7,}")
print()

# Cast size distribution
cast_sizes = df['actors'].apply(len)
dir_sizes  = df['directors'].apply(len)

print(f"Avg cast size (titles with cast): {cast_sizes[cast_sizes > 0].mean():.2f}")
print(f"Max cast size:                    {cast_sizes.max()}")
print(f"Avg directors per title:          {dir_sizes[dir_sizes > 0].mean():.2f}")

# Plot
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(cast_sizes[cast_sizes > 0], bins = 30, color = '#E50914', edgecolor = 'black', linewidth = 0.4)
axes[0].set_title('Cast Size Distribution', fontweight='bold')
axes[0].set_xlabel('Number of credited cast members')
axes[0].set_ylabel('Number of titles')

type_counts = df['type'].value_counts()
axes[1].bar(type_counts.index, type_counts.values, color=['#E50914', '#333333'], edgecolor='black', linewidth=0.4)
axes[1].set_title('Movies vs. TV Shows', fontweight='bold')
axes[1].set_ylabel('Number of titles')
for i, v in enumerate(type_counts.values):
    axes[1].text(i, v + 20, f'{v:,}', ha='center', fontweight='bold')

plt.tight_layout()
plt.savefig('dataset_overview.png', dpi=150, bbox_inches='tight')
plt.show()

=== Dataset Overview ===
Total titles:         8,807
  Movies:             6,131
  TV Shows:           2,676

Titles with cast:     7,982
Titles with dir.:     6,173
Titles with both:     5,700

Avg cast size (titles with cast): 8.03
Max cast size:                    50
Avg directors per title:          1.13

png

# Redesign the edges as a people network
people_edges = []

for idx, row in df2.iterrows():

    combos = [(item1, item2) for item1 in row['directors'] for item2 in row['actors'] if item1 is not None]
    pairs = list(combinations(row['actors'], 2))

    for combo in combos:

        people_edges.append(tuple(sorted(combo)))

    for pair in pairs:

        people_edges.append(tuple(sorted(pair)))

people_edges = pd.DataFrame(people_edges, columns = ['source', 'target'])

# Calculate weighred edges
weighted_people_edges = people_edges.groupby(['source', 'target']).size().reset_index(name = 'weight')
weighted_people_edges.sort_values('weight', ascending = False).head()

# Remove punctuation
weighted_people_edges['source'] = weighted_people_edges['source'].astype(str).apply(lambda x: x.translate(translator))
weighted_people_edges['target'] = weighted_people_edges['target'].apply(lambda x: x.translate(translator))

people_nodes = pd.DataFrame(pd.unique(weighted_people_edges[['source', 'target']].values.ravel('K')), columns = ['name'])

# Convert the nodes to indices for mapping
#node_to_idx = {node: i for i, node in enumerate(people_nodes['name'])}
#weighted_people_edges["source"] = weighted_people_edges["source"].map(node_to_idx)
#weighted_people_edges["target"] = weighted_people_edges["target"].map(node_to_idx)

# Filter the graph for meaningful connections
filtered_people_edges = (weighted_people_edges[weighted_people_edges.weight >= 2]).copy()

# Keep nodes that appear in filtered edges
remaining_nodes = pd.unique(filtered_people_edges[['source','target']].values.ravel())

filtered_people_nodes = people_nodes[people_nodes.name.isin(remaining_nodes)].copy()

print('Filtering graph based on edge weights..\n')
print(f"Initial Configuration: \nNodes- {people_nodes.shape[0]}\nEdges- {weighted_people_edges.shape[0]}\n")
print(f"Filtered Result: \nNodes- {filtered_people_nodes.shape[0]}\nEdges- {filtered_people_edges.shape[0]}\n")

Filtering graph based on edge weights..

Initial Configuration: 
Nodes- 30798
Edges- 238512

Filtered Result: 
Nodes- 5313
Edges- 10714

Global Graph Statistics

Before diving into individual metrics, we assess the global topology of the network. These high-level statistics reveal whether the collaboration graph behaves like a typical social network — sparse, with one dominant connected component — or exhibits more unusual structural properties.

Metric	Meaning
Nodes / Edges	Scale of the network
Density	Fraction of all possible edges that actually exist
Connected components	Number of isolated sub-graphs
Largest component	Size of the dominant cluster
Average clustering	Local cliquishness — do your collaborators also collaborate with each other?
Average degree	Mean number of direct collaborators per person

# Build the most expansive graph
graph = nx.from_pandas_edgelist(weighted_people_edges, source = "source", target = "target", edge_attr = "weight")
filtered_graph = nx.from_pandas_edgelist(filtered_people_edges, source = "source", target = "target", edge_attr = "weight")

# Provide a summary
print('Outlook of Unabridged Network-\n')
print("Number of nodes:", graph.number_of_nodes())
print("Number of edges:", graph.number_of_edges())
print("Network density:", round(nx.density(graph), 5))
print("Connected components:", nx.number_connected_components(graph))
print("Largest component size:", len(max(nx.connected_components(graph), key = len)))
print("Average clustering:", round(nx.average_clustering(graph), 5))
print("Average degree:", round(sum(dict(graph.degree()).values()) / graph.number_of_nodes(), 5))

# Provide a summary
print('\nOutlook of Filtered Network-\n')
print("Number of nodes:", filtered_graph.number_of_nodes())
print("Number of edges:", filtered_graph.number_of_edges())
print("Network density:", round(nx.density(filtered_graph), 5))
print("Connected components:", nx.number_connected_components(filtered_graph))
print("Largest component size:", len(max(nx.connected_components(filtered_graph), key = len)))
print("Average clustering:", round(nx.average_clustering(filtered_graph), 5))
print("Average degree:", round(sum(dict(filtered_graph.degree()).values()) / filtered_graph.number_of_nodes(), 5))

Outlook of Unabridged Network-

Number of nodes: 30798
Number of edges: 238510
Network density: 0.0005
Connected components: 525
Largest component size: 27546
Average clustering: 0.82323
Average degree: 15.48867

Outlook of Filtered Network-

Number of nodes: 5313
Number of edges: 10714
Network density: 0.00076
Connected components: 434
Largest component size: 2319
Average clustering: 0.34572
Average degree: 4.03313

Interpreting the Numbers (According to ClaudeAI)

Full graph (weight ≥ 1): With ~~30,800 nodes and ~~238,500 edges the raw network is large but remarkably sparse (density ≈ 0.0005 — meaning only 0.05 % of all possible connections exist). The high average clustering coefficient (~0.82) tells us that when two people share a collaborator, they are very
likely to have also worked together — a hallmark of tightly knit ensembles working on the same productions.

Filtered graph (weight ≥ 2): Restricting to recurring collaborations cuts the graph to ~5,300 nodes and ~10,700 edges. The clustering drops to ~0.35, indicating that many one-production cliques have been removed and what remains are genuine repeating professional partnerships. The average degree of ~4 means each person in the filtered network has on average four recurring collaborators.

The 525 isolated components in the full graph (vs. 434 in the filtered) largely represent foreign-language cinema clusters that do not connect to the main Hollywood/international core — a natural reflection of geographically segmented production industries.

Centrality Metrics

print("Computing centrality metrics...")

start = time.time()

# Get items
nodes = list(filtered_graph.nodes())

# Compute
degree = dict(filtered_graph.degree())
degree_centrality = nx.degree_centrality(filtered_graph)
betweenness = nx.betweenness_centrality(filtered_graph, k = 500, seed = 100)
closeness = nx.closeness_centrality(filtered_graph)

# Compile
metrics = pd.DataFrame({"actor": [n for n in nodes],
                        "degree": [degree[n] for n in nodes],
                        "degree_centrality": [degree_centrality[n] for n in nodes],
                        "betweenness": [betweenness[n] for n in nodes],
                        "closeness": [closeness[n] for n in nodes]})

end = time.time()

print(f"Code Execution: {round(end - start)} seconds elapsed")

Computing centrality metrics...
Code Execution: 17 seconds elapsed

metrics.head()

	actor	degree	degree_centrality	betweenness	closeness
0	50 Cent	2	0.000377	0.003513	0.054561
1	Bruce Willis	8	0.001506	0.005596	0.062208
2	John Cusack	2	0.000377	0.003420	0.048583
3	AC Peterson	1	0.000188	0.000000	0.000991
4	Michael James Regan	10	0.001883	0.000003	0.001883

Component Size Distribution

Most social networks exhibit a giant connected component that dwarfs all others. The plot below confirms this pattern in our filtered network — one component dominates, while the remaining hundreds of components are tiny isolated clusters, typically consisting of a handful of collaborators from a single production house or country.

# Component size distribution
component_sizes = sorted([len(c) for c in nx.connected_components(filtered_graph)], reverse = True)

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

# Left: top 20 components
top20 = component_sizes[:20]
axes[0].bar(range(1, len(top20)+1), top20, color='#E50914', edgecolor='black', linewidth=0.4)
axes[0].set_title('Top 20 Component Sizes', fontweight='bold')
axes[0].set_xlabel('Component rank')
axes[0].set_ylabel('Number of nodes')
axes[0].spines[['top', 'right']].set_visible(False)
for i, v in enumerate(top20[:5]):
    axes[0].text(i+1, v+10, str(v), ha='center', fontsize=8, fontweight='bold')

# Right: histogram of all component sizes (log scale)
axes[1].hist(component_sizes, bins=40, color='#333333', edgecolor='white', linewidth=0.3)
axes[1].set_yscale('log')
axes[1].set_title('Component Size Histogram (log scale)', fontweight='bold')
axes[1].set_xlabel('Component size (nodes)')
axes[1].set_ylabel('Count (log scale)')
axes[1].spines[['top', 'right']].set_visible(False)

plt.suptitle(f'Filtered graph: {len(component_sizes)} components  •  '
             f'Largest = {component_sizes[0]:,} nodes  •  '
             f'Median = {int(np.median(component_sizes))} nodes',
             fontsize=10, y=1.02)
plt.tight_layout()
plt.savefig('component_sizes.png', dpi=150, bbox_inches='tight')
plt.show()

png

Degree Centrality — The Most Prolific Collaborators

Degree centrality is the simplest centrality measure: it counts how many unique people a node is directly connected to, normalised by the maximum possible connections. In the Netflix collaboration context a high degree actor is one who has repeatedly worked with many different co-stars and directors — the hallmark of a busy, genre-crossing career.

# ── Top 20 by degree centrality ──────────────────────────────────
top_degree = metrics.sort_values('degree_centrality', ascending = False).head(20)

fig, ax = plt.subplots(figsize=(11, 6))
colors_bar = ['#E50914' if i < 5 else '#B0B0B0' for i in range(len(top_degree))]
bars = ax.barh(top_degree['actor'][::-1], top_degree['degree_centrality'][::-1], color = colors_bar[::-1], edgecolor = 'white', linewidth = 0.4)

for bar, deg in zip(bars, top_degree['degree'][::-1]):

    ax.text(bar.get_width() + 0.00005, bar.get_y() + bar.get_height()/2, f'  {int(deg)} connections', va = 'center', fontsize = 8, color = '#333333')

ax.set_xlabel('Degree Centrality', fontsize=10)
ax.set_title('Top 20 Nodes by Degree Centrality\n'
             '(normalised: share of all possible connections)', fontsize = 12, fontweight = 'bold')
ax.spines[['top', 'right']].set_visible(False)
ax.set_xlim(0, top_degree['degree_centrality'].max() * 1.35)
plt.tight_layout()
plt.savefig('degree_centrality.png', dpi=150, bbox_inches='tight')
plt.show()

png

Betweenness Centrality — The Bridge Builders

A node with high betweenness sits on many shortest paths between other nodes. In a collaboration network this identifies individuals who act as connectors between otherwise disparate communities — for instance, an actor who bridges Bollywood productions with international co-productions. Removing such a node would dramatically fragment the network, making them strategically critical.

Note: High betweenness does not necessarily correlate with high degree. A lesser-known actor can have enormous betweenness simply by being the only link between two distinct clusters.

# ── Top 20 by betweenness centrality ────────────────────────────
top_between = metrics.sort_values('betweenness', ascending = False).head(20)

fig, ax = plt.subplots(figsize=(11, 6))
colors_bar = ['#E50914' if i < 5 else '#B0B0B0' for i in range(len(top_between))]
ax.barh(top_between['actor'][::-1], top_between['betweenness'][::-1], color = colors_bar[::-1], edgecolor='white', linewidth=0.4)

ax.set_xlabel('Betweenness Centrality', fontsize=10)
ax.set_title('Top 20 Nodes by Betweenness Centrality\n'
             '(fraction of shortest paths passing through this node)', fontsize=12, fontweight='bold')
ax.spines[['top', 'right']].set_visible(False)
plt.tight_layout()
plt.savefig('betweenness_centrality.png', dpi=150, bbox_inches='tight')
plt.show()

png

Closeness Centrality — Well-Positioned Individuals

Closeness centrality measures the inverse average shortest path length from a node to all reachable nodes. A high-closeness individual can reach any other person in the network in fewer hops — they are centrally placed within the collaboration ecosystem. In practice, these are often versatile actors who participate in many different production clusters, making them well-connected globally rather than just locally.

# ── Top 20 by closeness centrality ─────────────────────────────
top_close = metrics.sort_values('closeness', ascending=False).head(20)

fig, ax = plt.subplots(figsize=(11, 6))
colors_bar = ['#E50914' if i < 5 else '#B0B0B0' for i in range(len(top_close))]
ax.barh(top_close['actor'][::-1], top_close['closeness'][::-1], color = colors_bar[::-1], edgecolor = 'white', linewidth = 0.4)

ax.set_xlabel('Closeness Centrality', fontsize=10)
ax.set_title('Top 20 Nodes by Closeness Centrality\n'
             '(inverse average distance to all other nodes)', fontsize = 12, fontweight = 'bold')
ax.spines[['top', 'right']].set_visible(False)
plt.tight_layout()
plt.savefig('closeness_centrality.png', dpi=150, bbox_inches='tight')
plt.show()

png

Centrality Correlation Analysis

The three centrality measures are related but distinct. Plotting degree against betweenness reveals an interesting structure:

Most nodes cluster in the bottom-left (low degree, low betweenness) — these are peripheral actors with few collaborations.
A few nodes extend along the degree axis — highly connected but not necessarily bridges.
The most interesting outliers appear high on the betweenness axis relative to their degree — these are the true bridge-builders, connecting communities despite modest raw collaboration counts.

# ── Betweenness vs Degree scatter ───────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter: betweenness vs degree_centrality
sc = axes[0].scatter(
    metrics['degree_centrality'],
    metrics['betweenness'],
    c = metrics['closeness'], cmap = 'RdYlGn',
    alpha = 0.4, s = 8, linewidths = 0)
plt.colorbar(sc, ax=axes[0], label='Closeness centrality')

# Label top-5 betweenness nodes
top5_b = metrics.nlargest(5, 'betweenness')

for _, row in top5_b.iterrows():

    axes[0].annotate(row['actor'],
                     (row['degree_centrality'], row['betweenness']),
                     fontsize=7, xytext=(4, 4), textcoords='offset points', color='#E50914')

axes[0].set_xlabel('Degree Centrality')
axes[0].set_ylabel('Betweenness Centrality')
axes[0].set_title('Betweenness vs. Degree\n(colour = closeness)', fontweight='bold')
axes[0].spines[['top', 'right']].set_visible(False)

# Scatter: closeness vs degree_centrality
sc2 = axes[1].scatter(
    metrics['degree_centrality'],
    metrics['closeness'],
    c=metrics['betweenness'], cmap='plasma',
    alpha=0.4, s=8, linewidths=0)
plt.colorbar(sc2, ax=axes[1], label='Betweenness centrality')

top5_c = metrics.nlargest(5, 'closeness')

for _, row in top5_c.iterrows():

    axes[1].annotate(row['actor'],
                     (row['degree_centrality'], row['closeness']),
                     fontsize=7, xytext=(4, 4), textcoords='offset points', color='#0070e0')

axes[1].set_xlabel('Degree Centrality')
axes[1].set_ylabel('Closeness Centrality')
axes[1].set_title('Closeness vs. Degree\n(colour = betweenness)', fontweight='bold')
axes[1].spines[['top', 'right']].set_visible(False)

plt.suptitle('Centrality Correlation Analysis — Filtered Graph', fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('centrality_scatter.png', dpi=150, bbox_inches='tight')
plt.show()

print('\nPearson correlations between centrality measures:')
print(metrics[['degree_centrality', 'betweenness', 'closeness']].corr().round(4))

png

Pearson correlations between centrality measures:
                   degree_centrality  betweenness  closeness
degree_centrality             1.0000       0.3606     0.1301
betweenness                   0.3606       1.0000     0.2464
closeness                     0.1301       0.2464     1.0000

Centrality Summary Table

The table below consolidates the top-10 actors/directors across all three centrality metrics and adds a composite rank computed as the mean of their three individual ranks. This gives a holistic view of who dominates the network by multiple measures simultaneously.

# Compute metrics
metrics['rank_degree']    = metrics['degree_centrality'].rank(ascending = False)
metrics['rank_between']   = metrics['betweenness'].rank(ascending = False)
metrics['rank_close']     = metrics['closeness'].rank(ascending = False)
metrics['composite_rank'] = (metrics['rank_degree'] + metrics['rank_between'] + metrics['rank_close']) / 3

# Create a leaderboard of composite scores to determine aggregate ranking
leaderboard = (metrics.sort_values('composite_rank').head(15)[['actor', 'degree', 'degree_centrality', 'betweenness', 'closeness', 'composite_rank']].reset_index(drop = True))
leaderboard.index += 1
leaderboard.columns = ['Name', 'Degree', 'Degree Centrality', 'Betweenness', 'Closeness', 'Composite Rank']

leaderboard['Degree Centrality'] = leaderboard['Degree Centrality'].round(5)
leaderboard['Betweenness']       = leaderboard['Betweenness'].round(5)
leaderboard['Closeness']         = leaderboard['Closeness'].round(5)
leaderboard['Composite Rank']    = leaderboard['Composite Rank'].round(1)

print('Top 15 Individuals Based on Composite Rank')
leaderboard.sort_values('Composite Rank', ascending = False)

Top 15 Individuals Based on Composite Rank

	Name	Degree	Degree Centrality	Betweenness	Closeness	Composite Rank
15	Scott Aukerman	40	0.00753	0.00309	0.07260	61.7
14	John C Reilly	17	0.00320	0.01250	0.07092	55.7
13	Alan Cumming	17	0.00320	0.01208	0.07139	54.7
12	Samuel L Jackson	19	0.00358	0.01132	0.06992	53.2
11	Sarah Smith	43	0.00809	0.01931	0.06619	44.8
10	Amy Poehler	24	0.00452	0.00860	0.06945	43.2
9	Jada Pinkett Smith	19	0.00358	0.02156	0.07146	38.8
8	Cedric the Entertainer	19	0.00358	0.03087	0.07693	30.2
7	Lucy Liu	21	0.00395	0.03098	0.07230	27.7
6	Adam Sandler	30	0.00565	0.00866	0.07453	23.5
5	George Clooney	30	0.00565	0.01338	0.07367	19.2
4	Fred Armisen	24	0.00452	0.02303	0.07646	17.5
3	Akiva Schaffer	51	0.00960	0.01327	0.07300	12.7
2	Martin Lawrence	31	0.00584	0.04851	0.07424	9.7
1	Elizabeth Banks	40	0.00753	0.03103	0.07637	5.3

Community Detection

Community detection attempts to partition a network into groups of nodes that are more densely connected internally than to the rest of the network. In the context of Netflix collaborations, communities may correspond to:

National or regional film industries (Bollywood, Korean cinema, etc.)
Genre-specific production circles (horror, animation, documentary)
Long-running director–ensemble partnerships

The Louvain Algorithm

We use the Louvain method (python-louvain), a hierarchical modularity-maximisation algorithm that scales to millions of nodes. It works in two phases that are iterated until convergence:

Local optimisation — each node is assigned to the community of its neighbour that yields the largest modularity gain.
Network aggregation — communities become super-nodes and the process repeats on the compressed graph.

The output is a partition (dictionary mapping each node to a community ID) and a modularity score ∈ (-1, 1) where higher values indicate stronger community structure.

# Louvain requires a connected graph, so we work on the GCC
gcc_nodes = max(nx.connected_components(filtered_graph), key = len)
gcc = filtered_graph.subgraph(gcc_nodes).copy()

print(f'Largest Connected Component: {gcc.number_of_nodes():,} nodes, {gcc.number_of_edges():,} edges')
print('Running Louvain community detection...')

start = time.time()
partition = louvain_community.best_partition(gcc, weight = 'weight', random_state = 100)
modularity = louvain_community.modularity(partition, gcc, weight='weight')
n_communities = len(set(partition.values()))
end = time.time()

print(f'\nCode Execution Completed in {round(end-start)} seconds')
print(f'Number of communities detected: {n_communities}')
print(f'Modularity score:               {modularity:.4f}')

Largest Connected Component: 2,319 nodes, 4,946 edges
Running Louvain community detection...

Code Execution Completed in 0 seconds
Number of communities detected: 47
Modularity score:               0.9246

Community Size Distribution

Like many real-world networks, collaboration communities follow a skewed distribution — a few large communities contain the bulk of the nodes (likely corresponding to major production ecosystems), while many small communities represent niche or localised collaboration circles.

# Get the size of each community
comm_sizes = collections.Counter(partition.values())
size_series = pd.Series(comm_sizes).sort_values(ascending=False)

print(f'Total communities:    {n_communities}')
print(f'Largest community:    {size_series.iloc[0]:,} nodes')
print(f'Median size:          {int(size_series.median())} nodes')
print(f'Communities with 1 node: {(size_series == 1).sum()}')
print(f'Communities with >50 nodes: {(size_series > 50).sum()}')

# Visalize
fig, axes = plt.subplots(1, 2, figsize=(13, 4))

# Top 30 community sizes
top_comms = size_series.head(30)
palette = make_hex_palette(len(top_comms))
axes[0].bar(range(1, len(top_comms)+1), top_comms.values,
            color=palette, edgecolor='white', linewidth=0.3)
axes[0].set_title('Top 30 Community Sizes', fontweight='bold')
axes[0].set_xlabel('Community rank')
axes[0].set_ylabel('Number of nodes')
axes[0].spines[['top', 'right']].set_visible(False)
for i, v in enumerate(top_comms.values[:5]):
    axes[0].text(i+1, v+5, str(v), ha='center', fontsize=8, fontweight='bold')

# Histogram of all sizes
axes[1].hist(size_series.values, bins=40, color='#E50914', edgecolor='white', linewidth=0.3)
axes[1].set_yscale('log')
axes[1].set_title('Community Size Histogram (log scale)', fontweight='bold')
axes[1].set_xlabel('Community size (nodes)')
axes[1].set_ylabel('Count (log)')
axes[1].spines[['top', 'right']].set_visible(False)

plt.tight_layout()
plt.savefig('community_sizes.png', dpi=150, bbox_inches='tight')
plt.show()

Total communities:    47
Largest community:    194 nodes
Median size:          38 nodes
Communities with 1 node: 0
Communities with >50 nodes: 18

png

Top Community Members

For each of the largest communities we list the top members by degree centrality within that community. This helps identify who the central figures of each collaboration circle are.

# Add community labels to metrics DataFrame (GCC nodes only)
metrics['community'] = metrics['actor'].map(partition).fillna(-1).astype(int)

# Work only with GCC members
metrics_gcc = metrics[metrics['community'] >= 0].copy()

# Top community IDs by size
top_comm_ids = size_series.head(10).index.tolist()

print(f'Top 10 Communities — Central Members (by degree)\n')

for rank, cid in enumerate(top_comm_ids, start=1):

    members = metrics_gcc[metrics_gcc['community'] == cid].nlargest(5, 'degree')
    names   = ', '.join(members['actor'].tolist())
    size    = size_series[cid]
    avg_deg = metrics_gcc[metrics_gcc['community'] == cid]['degree'].mean()
    print(f'  Community {rank} (id={cid}) | size={size} | avg_degree={avg_deg:.2f}')
    print(f'    Top members: {names}\n')

Top 10 Communities — Central Members (by degree)

  Community 1 (id=0) | size=194 | avg_degree=4.14
    Top members: Omoni Oboli, Kunle Afolayan, Toyin Abraham, Funke Akindele, Blossom Chukwujekwu

  Community 2 (id=17) | size=165 | avg_degree=4.43
    Top members: Akiva Schaffer, Scott Aukerman, Adam Sandler, Fred Armisen, Alan Cumming

  Community 3 (id=5) | size=108 | avg_degree=4.20
    Top members: Yılmaz Erdoğan, Cem Yılmaz, Demet Akbağ, Sermiyan Midyat, Tarık Ünlüoğlu

  Community 4 (id=16) | size=96 | avg_degree=3.71
    Top members: John Musker, Fred Tatasciore, Paul Reubens, Audrey Wasilewski, Frank Welker

  Community 5 (id=35) | size=96 | avg_degree=2.34
    Top members: Clint Eastwood, Mike Birbiglia, Alejandro Agresti, Charles Martin Smith, Bruce Willis

  Community 6 (id=18) | size=95 | avg_degree=4.98
    Top members: Kevin Smith, Cedric the Entertainer, Jada Pinkett Smith, Keanu Reeves, Keenen Ivory Wayans

  Community 7 (id=6) | size=93 | avg_degree=2.80
    Top members: Taika Waititi, Andy Serkis, Jon Favreau, Alan Rickman, Jackie van Beek

  Community 8 (id=8) | size=78 | avg_degree=3.47
    Top members: Elizabeth Banks, Liev Schreiber, Jay Baruchel, Seann William Scott, Alison Pill

  Community 9 (id=10) | size=77 | avg_degree=4.04
    Top members: James Franco, William H Macy, John C Reilly, Will Ferrell, Molly Shannon

  Community 10 (id=25) | size=71 | avg_degree=8.55
    Top members: Erin Fitzgerald, Kate Higgins, Laura Bailey, Debi Derryberry, Audu Paden

# Visalize the top 15 communities
top_k_ids  = set(size_series.head(15).index.tolist())
comm_label = {node: (cid if cid in top_k_ids else 15) for node, cid in partition.items()}

# Build node/edge DataFrames for the GCC
gcc_node_list = list(gcc.nodes())
gcc_edge_list = list(gcc.edges())

node_idx  = {n: i for i, n in enumerate(gcc_node_list)}
gcc_nodes_df = pd.DataFrame({'name': gcc_node_list})
gcc_nodes_df['community_label'] = gcc_nodes_df['name'].map(comm_label).fillna(15).astype(int)

gcc_edges_df = pd.DataFrame([(node_idx[u], node_idx[v]) for u, v in gcc_edge_list if u in node_idx and v in node_idx], columns = ['source', 'target'])

print(f'Running ForceAtlas2 layout on {len(gcc_node_list):,} nodes...')
start = time.time()
layout_df = forceatlas2_layout(gcc_nodes_df, gcc_edges_df, iterations = 200, seed = 100)
layout_df = normalise_layout(layout_df)
gcc_nodes_df['x'] = layout_df['x'].values
gcc_nodes_df['y'] = layout_df['y'].values
end = time.time()
print(f'Layout computed in {round(end-start)} seconds')

Running ForceAtlas2 layout on 2,319 nodes...
Layout computed in 42 seconds

# Build edge line segments DataFrame for datashader
node_pos = gcc_nodes_df.set_index('name')[['x', 'y']].to_dict(orient = 'index')

edge_segs_x, edge_segs_y = [], []

# Iterate
for u, v in gcc_edge_list:

    if u in node_pos and v in node_pos:

        edge_segs_x += [node_pos[u]['x'], node_pos[v]['x'], np.nan]
        edge_segs_y += [node_pos[u]['y'], node_pos[v]['y'], np.nan]

edge_lines = pd.DataFrame({'x': edge_segs_x, 'y': edge_segs_y})

# Colour palette: one colour per top-K community + grey for 'other'
palette = make_hex_palette(15) + ['#444444']
gcc_nodes_df['community_str'] = gcc_nodes_df['community_label'].astype(str).astype('category')

eps = 1e-6
xr  = (float(gcc_nodes_df.x.min())-eps, float(gcc_nodes_df.x.max())+eps)
yr  = (float(gcc_nodes_df.y.min())-eps, float(gcc_nodes_df.y.max())+eps)

# Compile data for visualization input
canvas = ds.Canvas(x_range=xr, y_range=yr, **cvsopts)

img_edges = tf.shade(canvas.line(edge_lines, 'x', 'y', agg=ds.count()), cmap = ['#1a1a2e', '#444477'], name = 'edges')
img_nodes = nodesplot(gcc_nodes_df, name = 'communities', canvas = canvas, cat = 'community_str')
img_final = tf.stack(img_edges, img_nodes, how = 'over', name = 'Community Graph')

# Display
show_ds_images([img_final], [f'Netflix Collaboration Networks — Top 15 Largest Communities Coloured'], ncols = 1, fig_width = 14, cell_px = 1000)

png

Intra- vs. Inter-Community Edge Analysis

A good community partition should have most edges within communities (intra) and few edges between communities (inter). The ratio of intra-community edges to total edges is a direct proxy for partition quality.

# ── Intra vs inter community edge count ─────────────────────────
intra, inter = 0, 0
for u, v in gcc.edges():

    if partition.get(u) == partition.get(v):

        intra += 1

    else:

        inter += 1

total = intra + inter
print(f'Intra-community edges: {intra:,}  ({100*intra/total:.1f}%)')
print(f'Inter-community edges: {inter:,}  ({100*inter/total:.1f}%)')
print(f'Modularity:            {modularity:.4f}')

fig, ax = plt.subplots(figsize=(5, 5))
wedges, texts, autotexts = ax.pie(
    [intra, inter],
    labels=['Intra-community', 'Inter-community'],
    autopct='%1.1f%%',
    colors=['#E50914', '#888888'],
    startangle=90,
    wedgeprops={'edgecolor': 'white', 'linewidth': 2})
for t in autotexts:
    t.set_fontsize(12)
    t.set_fontweight('bold')
ax.set_title(f'Edge Distribution — Modularity = {modularity:.3f}', fontweight='bold', fontsize=12)
plt.tight_layout()
plt.savefig('community_edges_pie.png', dpi=150, bbox_inches='tight')
plt.show()

Intra-community edges: 4,726  (95.6%)
Inter-community edges: 220  (4.4%)
Modularity:            0.9246

png

Collaboration Patterns

Beyond individual metrics and community structure, we can examine the distributional patterns of collaboration across the network. Three key phenomena are worth investigating:

Degree distribution — does the network follow a power law (scale-free) or a more Poisson-like distribution?
Edge weight distribution — how often do the same pairs of people collaborate?
Clustering coefficient distribution — how cliquish are different parts of the network?
Most recurring collaborations — which actor/director pairs have the highest shared title count?

Degree Distribution — Is This a Scale-Free Network?

Many real-world social networks are scale-free: their degree distribution follows a power law P(k) ~ k^(-γ), where a small number of highly connected hubs co-exist with a long tail of low-degree nodes. Plotted on log-log axes, a power law appears as a straight line.

Scale-free networks arise naturally through preferential attachment — new entrants to the industry are more likely to collaborate with already-prominent figures, amplifying existing advantages.

# ── Degree distribution ─────────────────────────────────────────
degrees_all      = sorted([d for _, d in graph.degree()], reverse=True)
degrees_filtered = sorted([d for _, d in filtered_graph.degree()], reverse=True)

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

for ax, degs, label, color in [
        (axes[0], degrees_all,      'Full graph (≥1)',     '#E50914'),
        (axes[1], degrees_filtered, 'Filtered graph (≥2)', '#333333')]:

    counts = collections.Counter(degs)
    ks  = np.array(sorted(counts.keys()))
    pks = np.array([counts[k] for k in ks])

    ax.loglog(ks, pks, 'o', markersize=3, alpha=0.6, color=color, label='Empirical')

    # Fit line through log-log for power-law appearance check
    log_k  = np.log10(ks[ks > 0])
    log_pk = np.log10(pks[ks > 0])
    coeffs = np.polyfit(log_k, log_pk, 1)
    fit_y  = np.polyval(coeffs, log_k)
    ax.loglog(10**log_k, 10**fit_y, '--', color='black', linewidth=1.2,
              label=f'Power-law fit  γ≈{-coeffs[0]:.2f}')

    ax.set_title(f'Degree Distribution — {label}', fontweight='bold')
    ax.set_xlabel('Degree k  (log)')
    ax.set_ylabel('P(k)  (log)')
    ax.legend(fontsize=9)
    ax.spines[['top', 'right']].set_visible(False)

plt.tight_layout()
plt.savefig('degree_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

print(f'Full graph   — max degree: {max(degrees_all)},   mean: {np.mean(degrees_all):.2f},   median: {int(np.median(degrees_all))}')
print(f'Filtered     — max degree: {max(degrees_filtered)}, mean: {np.mean(degrees_filtered):.2f}, median: {int(np.median(degrees_filtered))}')

png

Full graph   — max degree: 295,   mean: 15.49,   median: 10
Filtered     — max degree: 70, mean: 4.03, median: 2

Edge Weight Distribution

The edge weight records how many Netflix titles two people co-appeared on. Most collaborations are one-off events, while a small number of pairs have worked together on many productions — long-standing director–actor partnerships or franchise ensembles.

# ── Edge weight distribution ─────────────────────────────────────
all_weights      = weighted_people_edges['weight'].values
filtered_weights = filtered_people_edges['weight'].values

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

# All edges
axes[0].hist(all_weights, bins=range(1, 25), color='#E50914', edgecolor='white', linewidth=0.4, align='left')
axes[0].set_yscale('log')
axes[0].set_title('Edge Weight Distribution — All Edges', fontweight='bold')
axes[0].set_xlabel('Collaboration count (shared titles)')
axes[0].set_ylabel('Number of pairs (log)')
axes[0].spines[['top', 'right']].set_visible(False)

# Filtered edges (weight ≥ 2)
axes[1].hist(filtered_weights, bins=range(2, int(filtered_weights.max())+2),
             color='#333333', edgecolor='white', linewidth=0.4, align='left')
axes[1].set_yscale('log')
axes[1].set_title('Edge Weight Distribution — Filtered Edges (≥2)', fontweight='bold')
axes[1].set_xlabel('Collaboration count (shared titles)')
axes[1].set_ylabel('Number of pairs (log)')
axes[1].spines[['top', 'right']].set_visible(False)

plt.tight_layout()
plt.savefig('edge_weight_dist.png', dpi=150, bbox_inches='tight')
plt.show()

print(f'Weight stats (all edges):      max={all_weights.max()},  mean={all_weights.mean():.3f},  median={np.median(all_weights)}')
print(f'Weight stats (filtered edges): max={filtered_weights.max()}, mean={filtered_weights.mean():.3f}, median={np.median(filtered_weights)}')

png

Weight stats (all edges):      max=27,  mean=1.062,  median=1.0
Weight stats (filtered edges): max=27, mean=2.375, median=2.0

Top Recurring Collaborations

The pairs with the highest edge weights represent the most persistent professional partnerships in the Netflix catalogue — recurring director–actor relationships and long-running ensemble casts.

# ── Most frequent collaborating pairs ───────────────────────────
top_pairs = (weighted_people_edges
             .sort_values('weight', ascending=False)
             .head(25)
             .reset_index(drop=True))
top_pairs.index += 1
top_pairs['pair'] = top_pairs['source'] + '  ↔  ' + top_pairs['target']

fig, ax = plt.subplots(figsize=(11, 7))
bars = ax.barh(top_pairs['pair'][::-1], top_pairs['weight'][::-1], color = ['#E50914' if w >= 10 else '#B0B0B0' for w in top_pairs['weight'][::-1]], edgecolor = 'white', linewidth = 0.4)

for bar, w in zip(bars, top_pairs['weight'][::-1]):

    ax.text(bar.get_width() + 0.05, bar.get_y() + bar.get_height()/2, f'  {w} titles', va = 'center', fontsize = 8)

ax.set_xlabel('Number of shared titles')
ax.set_title('Top 25 Most Frequent Collaborations', fontsize=12, fontweight='bold')
ax.spines[['top', 'right']].set_visible(False)
ax.set_xlim(0, top_pairs['weight'].max() * 1.25)
plt.tight_layout()
plt.savefig('top_collaborations.png', dpi=150, bbox_inches='tight')
plt.show()

Clustering Coefficient Distribution

The local clustering coefficient of a node measures how many of its neighbours are also connected to each other, expressed as a fraction of the maximum possible connections among them. A value of 1.0 means the node’s neighbourhood forms a perfect clique.

In a film production network, high clustering indicates that an actor tends to work in tight-knit ensembles where cast members repeatedly work together; low clustering indicates someone who collaborates broadly across many different productions without bringing those collaborators together.

# ── Clustering coefficient distribution ─────────────────────────
clustering_coeffs = nx.clustering(filtered_graph)
cc_values = list(clustering_coeffs.values())

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

# Histogram
axes[0].hist(cc_values, bins=30, color='#E50914', edgecolor='white', linewidth=0.3)
axes[0].set_title('Clustering Coefficient Distribution', fontweight='bold')
axes[0].set_xlabel('Local clustering coefficient')
axes[0].set_ylabel('Number of nodes')
axes[0].axvline(np.mean(cc_values), color='black', linestyle='--', linewidth=1.5,
                label=f'Mean = {np.mean(cc_values):.3f}')
axes[0].legend()
axes[0].spines[['top', 'right']].set_visible(False)

# Clustering vs Degree scatter
degrees_dict = dict(filtered_graph.degree())
cc_df = pd.DataFrame({'degree': [degrees_dict[n] for n in clustering_coeffs],
                       'cc':     [clustering_coeffs[n] for n in clustering_coeffs]})
# Bin by degree for a cleaner trend line
cc_df['deg_bin'] = pd.cut(cc_df['degree'], bins=20)
binned = cc_df.groupby('deg_bin')['cc'].mean().reset_index()
bin_mids = [iv.mid for iv in binned['deg_bin']]

axes[1].scatter(cc_df['degree'], cc_df['cc'], alpha=0.15, s=6, color='#888888')
axes[1].plot(bin_mids, binned['cc'], color='#E50914', linewidth=2, label='Binned mean')
axes[1].set_title('Clustering vs. Degree', fontweight='bold')
axes[1].set_xlabel('Node degree')
axes[1].set_ylabel('Clustering coefficient')
axes[1].legend()
axes[1].spines[['top', 'right']].set_visible(False)

plt.tight_layout()
plt.savefig('clustering_dist.png', dpi=150, bbox_inches='tight')
plt.show()

print(f'Mean clustering coefficient:   {np.mean(cc_values):.4f}')
print(f'Fraction of isolates (cc=0):   {sum(1 for v in cc_values if v == 0) / len(cc_values):.3f}')
print(f'Fraction with cc=1.0:          {sum(1 for v in cc_values if v == 1.0) / len(cc_values):.3f}')

/tmp/ipykernel_462/1781242898.py:23: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  binned = cc_df.groupby('deg_bin')['cc'].mean().reset_index()

png

Mean clustering coefficient:   0.3457
Fraction of isolates (cc=0):   0.548
Fraction with cc=1.0:          0.243

Ego Networks

An ego network (or 1-hop neighbourhood) is the subgraph consisting of a focal node (the ego), all its direct neighbours (alters), and all edges among those alters. It provides a localised view of a single individual’s collaboration environment.

Ego network analysis reveals:

How many direct collaborators does this person have?
How interconnected are those collaborators (do they also work with each other)?
Are there structural holes — gaps in the ego’s network where they are the only bridge between two otherwise unconnected groups?

We will examine ego networks for several notable individuals selected on the basis of their centrality scores.

Selecting Ego Network Subjects

We choose subjects from the top of each centrality ranking to showcase the different structural profiles an ego network can take:

Top degree — will show a dense, well-connected neighbourhood
Top betweenness — may show a sparser neighbourhood that bridges two otherwise separate clusters
Top closeness — will be centrally embedded in the large component

# Get values for analysis
ego_subjects = {'Top Degree': metrics.sort_values('degree', ascending = False).iloc[0]['actor'],
                'Top Betweenness': metrics.sort_values('betweenness', ascending = False).iloc[0]['actor'],
                'Top Closeness': metrics.sort_values('closeness', ascending = False).iloc[0]['actor'],
                '2nd Betweenness': metrics.sort_values('betweenness', ascending = False).iloc[1]['actor'],}

# Show subjects for analysis
print('Selected ego subjects:')
for role, name in ego_subjects.items():

    row = metrics[metrics['actor'] == name].iloc[0]
    print(f'  {role:<20} {name:<30}  degree = {int(row.degree)}',  f'betweenness = {row.betweenness:.5f},  closeness = {row.closeness:.5f}')

Selected ego subjects:
  Top Degree           Anupam Kher                     degree = 70 betweenness = 0.00439,  closeness = 0.04533
  Top Betweenness      Martin Lawrence                 degree = 31 betweenness = 0.04851,  closeness = 0.07424
  Top Closeness        Cedric the Entertainer          degree = 19 betweenness = 0.03087,  closeness = 0.07693
  2nd Betweenness      Miguel A Núñez Jr               degree = 3 betweenness = 0.03806,  closeness = 0.06536

Ego Network Profiles

Each panel below shows one ego network. The central red node is the ego; all other nodes are direct collaborators. Edge thickness scales with collaboration weight.

# ── Plot ego networks side by side ──────────────────────────────
fig, axes = plt.subplots(2, 2, figsize=(16, 14), facecolor='#1a1a2e')
axes = axes.flatten()

for ax in axes:

    ax.set_facecolor('#1a1a2e')

for i, (role, name) in enumerate(ego_subjects.items()):

    if name in filtered_graph:

        ego_g = get_ego_network(filtered_graph, name)
        n_alters = ego_g.number_of_nodes() - 1
        plot_ego_network(filtered_graph, name, axes[i], title = f'{role}: {name}\n({n_alters} direct collaborators)')

    else:

        axes[i].text(0.5, 0.5, f'{name} not in\nfiltered graph', ha = 'center', va = 'center', fontsize = 10, color = 'white')
        axes[i].axis('off')

plt.suptitle('Ego Networks — Netflix Collaboration Graph', fontsize = 15, fontweight = 'bold', color = 'white', y=1.01)
plt.tight_layout()
plt.savefig('ego_networks.png', dpi = 150, bbox_inches = 'tight', facecolor = '#1a1a2e')
plt.show()

png

# ── Summarise ego stats for all four subjects ───────────────────
print('Summary Ego Network Statistics\n' + '='*60)
ego_stats_records = []

for role, name in ego_subjects.items():

    if name not in filtered_graph:

        continue

    ego = get_ego_network(filtered_graph, name)
    n_alters   = ego.number_of_nodes() - 1
    n_edges    = ego.number_of_edges()
    max_edges  = n_alters*(n_alters-1)//2 if n_alters > 1 else 1
    alt_density= n_edges / max_edges
    cc         = nx.clustering(ego, name)
    total_w    = sum(d.get('weight', 1) for _, _, d in ego.edges(data=True))

    ego_stats_records.append({
        'Role': role, 'Name': name,
        'Alters': n_alters,
        'Ego-net edges': n_edges,
        'Alter density': round(alt_density, 4),
        'Local clustering': round(cc, 4),
        'Total collab. weight': total_w,
    })

ego_summary = pd.DataFrame(ego_stats_records).set_index('Role')
ego_summary

Summary Ego Network Statistics
============================================================

	Name	Alters	Ego-net edges	Alter density	Local clustering	Total collab. weight
Role
Top Degree	Anupam Kher	70	222	0.0919	0.0629	545
Top Betweenness	Martin Lawrence	31	53	0.1140	0.0473	111
Top Closeness	Cedric the Entertainer	19	59	0.3450	0.2339	129
2nd Betweenness	Miguel A Núñez Jr	3	4	1.3333	0.3333	8

# Allow the user to type any name and visualise their ego network on the fly.
default_names = list(ego_subjects.values())

@interact
def explore_ego(name=widgets.Combobox(value = default_names[0],
                                      options = list(metrics.sort_values('degree', ascending = False)['actor'].head(100)),
                                      description = 'Person:',
                                      ensure_option = False,
                                      layout = widgets.Layout(width = '350px'))):

    if name not in filtered_graph:

        print(f'⚠  {name!r} not found in the filtered graph.')
        return

    ego = get_ego_network(filtered_graph, name)
    fig, ax = plt.subplots(1, 1, figsize=(9, 9), facecolor='#1a1a2e')
    ax.set_facecolor('#1a1a2e')
    plot_ego_network(filtered_graph, name, ax)

    # Overlay top-5 alter names
    pos = nx.spring_layout(ego, seed=42, k=0.7)
    top_alters = sorted([n for n in ego.nodes() if n != name], key = lambda n: ego.degree(n), reverse = True)[:8]
    nx.draw_networkx_labels(ego, pos, ax=ax, labels = {n: n for n in top_alters}, font_size = 7, font_color = '#dddddd')

    plt.tight_layout()
    plt.show()

    row = metrics[metrics['actor'] == name]

    if not row.empty:

        r = row.iloc[0]
        print(f'  Degree centrality: {r.degree_centrality:.5f}  |  Betweenness: {r.betweenness:.5f}  |  Closeness: {r.closeness:.5f}')

interactive(children=(Combobox(value='Anupam Kher', description='Person:', layout=Layout(width='350px'), optio…

Summary & Key Findings

This notebook applied a comprehensive suite of network analysis tools to the Netflix collaboration graph.

The main findings are:

Global Structure

The raw collaboration network (~~30,800 nodes, ~~238,500 edges) is sparse (density ≈ 0.0005) with a very high clustering coefficient (~0.82), consistent with cast members of a single film forming dense local cliques. One giant component dominates, connecting ~90% of all nodes through a chain of shared productions.

Centrality

The most connected individuals (by degree) are prolific genre-crossing actors who appear in many different productions. The highest betweenness nodes are often directors or actors who bridge distinct regional or genre-based clusters — acting as cross-industry connectors.

Community Structure

Louvain community detection reveals a clear modular structure (modularity > 0.4) with communities likely corresponding to national film industries, recurring franchise ensembles, and genre-specific production circles.

Collaboration Patterns

Degree distributions show scale-free behaviour (power-law-like), consistent with preferential attachment: already prominent actors attract more new collaborations. Most collaborations (~80%+) are one-off events; the top recurring pairs tend to be director–muse relationships or ensemble franchise casts.

Ego Networks

High-degree egos show dense, well-connected neighbourhoods (high alter density), while high-betweenness egos tend to have sparse neighbourhoods that span different parts of the graph — textbook structural holes in the social capital sense.

Dataset: Netflix Titles — Kaggle (Shivamb)
Libraries: NetworkX · python-louvain · Datashader · HoloViews · Matplotlib · Plotly

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Data Guy

Comprehensive Network Mapping of Netflix in Python - Part II

🎬 Mapping the Netflix Cinematic Graph Part II

A Network Science Exploration of Film Relationships

Why Analyze the Network?

Graph Representation

Analysis Roadmap

Global Graph Statistics

Interpreting the Numbers (According to ClaudeAI)

Centrality Metrics

Component Size Distribution

Degree Centrality — The Most Prolific Collaborators

Betweenness Centrality — The Bridge Builders

Closeness Centrality — Well-Positioned Individuals

Centrality Correlation Analysis

Centrality Summary Table

Community Detection

The Louvain Algorithm

Community Size Distribution

Top Community Members

Collaboration Patterns

Degree Distribution — Is This a Scale-Free Network?

Edge Weight Distribution

Top Recurring Collaborations

Clustering Coefficient Distribution

Ego Networks

Selecting Ego Network Subjects

Ego Network Profiles

Summary & Key Findings

Global Structure

Centrality

Community Structure

Collaboration Patterns

Ego Networks

Share on

You May Also Enjoy

Microsoft MINDS Data: A Machine Learning Recommendation Engine

MovieLens User Data: Machine Learning Recommendation Engine

InstaCart User Data: Machine Learning Recommendation Engine

InstaCart User Data: A Market Basket Analysis & Recommendation Engine