TY - JOUR AU - Aubaidan, Bashar Hamad AU - Kadir , Rabiah Abdul AU - Ijab, Mohamad Taha PY - 2024 TI - A Comparative Analysis of Smote and CSSF Techniques for Diabetes Classification Using Imbalanced Data JF - Journal of Computer Science VL - 20 IS - 9 DO - 10.3844/jcssp.2024.1146.1165 UR - https://thescipub.com/abstract/jcssp.2024.1146.1165 AB - Diabetes, a prevalent chronic metabolic disorder, poses a significant burden on healthcare systems worldwide. Accurate and timely diagnosis is crucial for effective management and complication prevention. Machine learning presents a promising solution but often faces challenges due to class imbalance within datasets, particularly the underrepresentation of diabetic cases. To address this issue, we introduce Cluster-based Synthetic Sample Filtering (CSSF), a method that enhances synthetic sample quality through advanced clustering and filtering techniques. Building upon the Synthetic Minority Over-sampling Technique (SMOTE), CSSF strategically generates synthetic samples within clusters while eliminating noisy instances, thereby improving classification accuracy and reliability. Comparative analysis demonstrates CSSF's effectiveness in mitigating class imbalance. Initial models achieved a 67% accuracy rate, which improved to 82% after smote preprocessing. CSSF further elevated accuracy to an impressive 90%. Notably, Support Vector Machines (SVM), neural networks (deep learning) and random forest achieved a remarkable 92% accuracy post-CSSF preprocessing. Decision tree and K-Nearest Neighbors (KNN) also demonstrated commendable accuracy after CSSF preprocessing. Crucially, CSSF consistently outperformed smote in precision, recall, and the F1-score, highlighting its superiority. Recognizing the importance of ethical AI practices, this study addresses ethical considerations and potential biases in machine learning within healthcare data analysis, promoting fairness, transparency and responsible AI utilization. This research underscores the necessity of ethical and effective approaches to address class imbalance in diabetes classification