Modeling Heart Disease with AI

8 minute read

Published:

In this experiment, we use a python library for optimizing model parameters employed with various machine learning algorithms to maximize their scoring ability on out-of-sample tests to detect heart disease based on metabolic indicators.

Machine Learning- An Approach to Choosing Model Parameters

Introduction- In this notebook, we will use the Optuna library to conduct experiments which can be used to set model hyperparameters for effectively implementing various machine learning algorithms

Table of Contents


πŸ«€ Predicting Heart Disease- A Machine Learning Approach to Health


In this notebook, we will utilize advanced optimization techniques and feature engineering to improve machine learning scores regarding a dataset of patients’ metabolic information to see if this data can help with indications as to whether or not an individual has or could develop heart disease.


The dataset features are as follows-

FeatureπŸ§“ Age🚹 SexπŸ’” Chest pain typeπŸ’‰ BP🧈 Cholesterol🍬 FBS over 120πŸ“ˆ EKG results❀️ Max HRπŸƒ Exercise anginaπŸ“‰ ST depression⛰️ Slope of ST🩸 Number of vessels fluro🧬 Thallium🎯 Heart Disease

Each feature is derived from well-established cardiological diagnostics. Together they capture multiple physiological dimensions of coronary artery disease:

πŸ§“ Demographics

  • Age
  • Sex

🧈 Metabolic Risk

  • Cholesterol
  • Fasting Blood Sugar
  • Blood Pressure

❀️ Symptom Profiles

  • Chest Pain Type
  • Exercise-induced Angina

πŸ“ˆ Functional Testing

  • EKG Results
  • ST Depression
  • ST Slope
  • Max Heart Rate

🩻 Imaging Diagnostics

  • Fluoroscopy Vessels
  • Thallium Stress Test

Back to top..


πŸ’» Setup and Environment

# Install packages
!pip install kaggle kagglehub matplotlib seaborn lazypredict ipywidgets jupyter_contrib_nbextensions optuna catboost -q
# Import libraries
import numpy as np
import pandas as pd
from IPython.display import display, HTML
import os
from sklearn.model_selection import train_test_split
import lazypredict
from lazypredict.Supervised import LazyClassifier
import subprocess
import json
import zipfile
from io import BytesIO
import xgboost as xgb
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import catboost as cb
import zipfile
import io
from contextlib import redirect_stdout
# Download the data from source
!kaggle competitions download -c playground-series-s6e2 --force
Downloading playground-series-s6e2.zip to /Users/anon/Downloads
  0%|                                               | 0.00/10.2M [00:00<?, ?B/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10.2M/10.2M [00:00<00:00, 1.29GB/s]
!pwd
/Users/anon/Downloads
print('Loading data..')

# Read the data
zf = zipfile.ZipFile('/Users/anon/Downloads/playground-series-s6e2.zip') 
df = pd.read_csv(zf.open('train.csv')).drop('id', axis = 1)
test = pd.read_csv(zf.open('test.csv')) #.drop('id', axis = 1)
Loading data..
df.head()
AgeSexChest pain typeBPCholesterolFBS over 120EKG resultsMax HRExercise anginaST depressionSlope of STNumber of vessels fluroThalliumHeart Disease
058141522390015813.60227Presence
152111253250217100.00103Absence
256021601880215100.00103Absence
344031342290215001.00203Absence
458141402340212513.80233Presence
# Map the outcome to numeric
def convert_outcomes(outcome):

    if outcome == 'Absence':

        return 0

    elif outcome == 'Presence':

        return 1
        
df['Heart Disease'] = df['Heart Disease'].map(convert_outcomes)
# Split the data for training
x = df[[i for i in df.columns if i != 'Heart Disease']]
y = df['Heart Disease']

xtrain, xval, ytrain, yval = train_test_split(x, y, test_size = .2, random_state = 100)

Back to top..


Experimentation πŸ”§

# Define objective function for optuna with lgith gbm
def objective(trial):

    dtrain = lgb.Dataset(xtrain, label = ytrain)
    
    params = {"objective": "binary",
              "metric": "auc",
              "verbosity": -1,
              "boosting_type": "gbdt",
              "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
              "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
              "num_leaves": trial.suggest_int("num_leaves", 2, 256),
              "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
              "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
              "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
              "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),}

        
    gbm = lgb.train(params, dtrain)
    preds = gbm.predict(xval)
    pred_labels = np.rint(preds)
    auc = roc_auc_score(yval, pred_labels)

    return auc
# Create a study and optimize
print('Initiating Optimization Sequence..')

# Silence optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

# Conduct study
study = optuna.create_study(direction = 'maximize') 
study.optimize(objective, n_trials = 50, n_jobs = 4) 

# Print the captured output to the console (using the restored stdout)
print("Modeling complete..")

# Print the best hyperparameters and AUC
print('Optimized Light GBM Performance- ')
print(f"Best trial value (AUC): {study.best_value}")
print(f"Best hyperparameters: {study.best_params}")
Initiating Optimization Sequence..
Modeling complete..
Optimized Light GBM Performance- 
Best trial value (AUC): 0.8865625304640333
Best hyperparameters: {'lambda_l1': 0.005711951061229914, 'lambda_l2': 0.02701383662599308, 'num_leaves': 162, 'feature_fraction': 0.4779878924499187, 'bagging_fraction': 0.9104828961933018, 'bagging_freq': 4, 'min_child_samples': 79}
# Convert the data for modeling
dtrain = lgb.Dataset(x, label = y)

# Fit
gbm = lgb.train(study.best_params, dtrain)

# Predict
gbmtest = gbm.predict(test.drop(['id'], axis = 1))

# Generate submissions
test['Heart Disease'] = gbmtest

# Write to file
test[['id', 'Heart Disease']].to_csv('heart_disease_test_gbm.csv', index = False)
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.012748 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 419
[LightGBM] [Info] Number of data points in the train set: 630000, number of used features: 13
[LightGBM] [Info] Start training from score 0.448340
print('Uploading scores to Kaggle..')
!kaggle competitions submit -c playground-series-s6e2 -f heart_disease_test_gbm.csv -m "LightGBM heart disease predictions test set"
Uploading scores to Kaggle..
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6.89M/6.89M [00:04<00:00, 1.62MB/s]
Successfully submitted to Predicting Heart Disease
!kaggle competitions submissions -c playground-series-s6e2
fileName                               date                        description                                                                 status                     publicScore  privateScore  
-------------------------------------  --------------------------  --------------------------------------------------------------------------  -------------------------  -----------  ------------  
heart_disease_test_gbm.csv             2026-03-06 16:43:05.187000  LightGBM heart disease predictions test set                                 SubmissionStatus.COMPLETE  0.95240      0.95401       
submission.csv                         2026-02-24 19:38:04                                                                                     SubmissionStatus.COMPLETE  0.87823      0.87954       
submission.csv                         2026-02-23 20:31:03.743000  ensemblemodeling-heart-disease-predictions | LightGBM * XGBoost *CatBoost   SubmissionStatus.COMPLETE  0.87823      0.87954       
submission.csv                         2026-02-23 20:30:39         StackedEnsemble heart disease predictions test set                          SubmissionStatus.COMPLETE  0.87823      0.87954       
submission.csv                         2026-02-23 20:01:25         StackedEnsemble heart disease predictions test set                          SubmissionStatus.COMPLETE  0.87641      0.87986       
heart_disease_test_gbm.csv             2026-02-22 22:20:26         LightGBM heart disease predictions test set                                 SubmissionStatus.COMPLETE  0.95227      0.95398       
heart_disease_test_gbm.csv             2026-02-22 22:17:42.617000  LightGBM heart disease predictions test set                                 SubmissionStatus.COMPLETE  0.95217      0.95392       
submission.csv                         2026-02-22 22:04:35.347000  LightGBM Featurized heart disease predictions test set                      SubmissionStatus.COMPLETE  0.95230      0.95379       
submission.csv                         2026-02-22 21:51:05.337000  heart-disease-predictions:feature-engineering | .95+ auc lightgbm           SubmissionStatus.COMPLETE  0.95179      0.95351       
submission.csv                         2026-02-22 21:50:42.543000  LightGBM Featurized heart disease predictions test set                      SubmissionStatus.COMPLETE  0.95179      0.95351       
heart_disease_test_gbm.csv             2026-02-22 21:50:25         LightGBM heart disease predictions test set                                 SubmissionStatus.COMPLETE  0.95233      0.95392       
submission.csv                         2026-02-22 21:31:52         LightGBM Featurized heart disease predictions test set                      SubmissionStatus.COMPLETE  0.95212      0.95385       
heart_disease_test_gbm_featurized.csv  2026-02-22 21:11:45         LightGBM Featurized heart disease predictions test set                      SubmissionStatus.COMPLETE  0.95212      0.95385       
heart_disease_test_gbm.csv             2026-02-22 20:08:58.903000  LightGBM heart disease predictions test set                                 SubmissionStatus.COMPLETE  0.95234      0.95403       
submission.csv                         2026-02-22 18:59:36         h2o heart disease predictions test set                                      SubmissionStatus.COMPLETE  0.88452      0.88571       
submission.csv                         2026-02-21 14:04:38.603000  Notebook heart-disease-prediction: comprehensive-modeling                   SubmissionStatus.COMPLETE  0.88460      0.88559       
submission.csv                         2026-02-21 14:04:13.253000  h2o heart disease predictions test set                                      SubmissionStatus.COMPLETE  0.88460      0.88559       
submission.csv                         2026-02-20 17:36:38.107000  Notebook Heart Disease Prediction: Comprehensive Model Eva | Version 5      SubmissionStatus.COMPLETE  0.88472      0.88562       
submission.csv                         2026-02-20 17:36:12.513000  h2o heart disease predictions test set                                      SubmissionStatus.COMPLETE  0.88472      0.88562       
submission.csv                         2026-02-20 16:03:12.007000  Notebook Heart Disease Prediction: Comprehensive Model Eva | Version 4      SubmissionStatus.COMPLETE  0.88468      0.88564       
submission.csv                         2026-02-20 16:02:44.473000  h2o heart disease predictions test set                                      SubmissionStatus.COMPLETE  0.88468      0.88564       
heart_disease_test_h2o.csv             2026-02-20 15:45:56         h2o heart disease predictions test set                                      SubmissionStatus.COMPLETE  0.88463      0.88568       
heart_disease_test_h2o.csv             2026-02-20 15:30:25         h2o heart disease predictions test set                                      SubmissionStatus.COMPLETE  0.88460      0.88569       
heart_disease_test_h2o.csv             2026-02-19 16:58:11.307000  h2o heart disease predictions test set                                      SubmissionStatus.COMPLETE  0.88461      0.88563       
heart_disease_test_h2o.csv             2026-02-19 16:37:17.767000  h2o heart disease predictions test set                                      SubmissionStatus.COMPLETE  0.88407      0.88527       
heart_disease_test_h2o.csv             2026-02-19 15:41:44.227000  h2o heart disease predictions test set                                      SubmissionStatus.COMPLETE  0.88458      0.88542       
heart_disease_test_gbm.csv             2026-02-18 17:32:39         LightGBM heart disease predictions test set                                 SubmissionStatus.COMPLETE  0.95238      0.95395       
heart_disease_test_catboost.csv        2026-02-18 17:26:48.153000  catboost heart disease predictions test set                                 SubmissionStatus.COMPLETE  0.88391      0.88620       
heart_disease_test_xgb.csv             2026-02-18 17:10:46         xgb heart disease predictions test set                                      SubmissionStatus.COMPLETE  0.88257      0.88532       
heart_disease_test_rf.csv              2026-02-18 17:10:40         rf heart disease predictions test set                                       SubmissionStatus.COMPLETE  0.87568      0.87834       
heart_disease_test_catboost.csv        2026-02-18 14:23:04         catboost heart disease predictions test set                                 SubmissionStatus.COMPLETE  0.88410      0.88643       
heart_disease_test_xgb.csv             2026-02-18 14:18:58         xgb heart disease predictions test set                                      SubmissionStatus.COMPLETE  0.88257      0.88532       
heart_disease_test_rf.csv              2026-02-18 14:14:17.973000  rf heart disease predictions test set                                       SubmissionStatus.COMPLETE  0.87568      0.87834       
# Get scores from the mode
lgb.plot_importance(gbm)
<Axes: title={'center': 'Feature importance'}, xlabel='Feature importance', ylabel='Features'>

png

Back to top..