Predictive Modeling of Coronary Artery Disease: Data Preprocessing
Keywords:
Coronary Artery Disease, Machine Learning, Predictive Modeling, Risk Diagnosis, CAD PredictionAbstract
Coronary Artery Disease (CAD) is a global cardiovascular health problem with ever growing rates, emphasizing the need for convenient, reliable and efficient diagnostic methods. This study explores the use of ML based methods and algorithms for the early prediction of CAD using the Heart Disease dataset from the UCI Repository. The various challenges posed by the dataset, which include missing or noisy data and unbalanced sampling of target variable have been addressed in this study. A comprehensive machine learning pipeline was implemented wherein multiple data imputation methods were tested, the dataset was cleaned and further balanced using SMOTEENN. The balanced dataset was validated using a stratified split, the dataset was then used to train nine ML models including LR, SVM, Naïve Bayes, tree based models and custom ensemble models. Hyperparameter tuning of the models was done using GridSearchCV. The custom Voting Ensemble model achieved the highest accuracy of 96.53% and an AUC of 0.98, followed by the custom Stacking Ensemble model with 95.95% accuracy, rest of the models achieved an accuracy greater than or equal to 93.18% indicating high preprocessed data quality. The results demonstrate the importance of high data quality and the effectiveness of ensemble models in capturing underlying patterns within patient data for clinical applications.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Journal of One Health Advances

This work is licensed under a Creative Commons Attribution 4.0 International License.