OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 21.04.2026, 15:58

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Solving the multicollinearity problem to improve the stability of machine learning algorithms applied to a fully annotated breast cancer dataset

2022·54 Zitationen·Informatics in Medicine UnlockedOpen Access
Volltext beim Verlag öffnen

54

Zitationen

4

Autoren

2022

Jahr

Abstract

Among the different types of cancer, breast cancer is the most common cancer affecting females in Jordan. Recurrent breast cancer after treatment is a significant concern for patients and oncologists. Developing countries like Jordan suffer from a lack of quality data on computational medicine (CM). This paper discusses the design, construction, and evaluation of an extensive, fully annotated breast cancer dataset extracted from the King Hussein Cancer Center's registry database (KHCC) in Amman, Jordan. The Jordan Breast Cancer dataset (JBRCA) has 20 attributes and 7562 instances of breast cancer patients. It can be considered a valuable resource to motivate future research in CM in the country. By illustration, the study describes the problems facing the compilation of the dataset. A thorough analysis of the dataset brought up many issues that required remedies before the dataset could be used in machine learning (ML) applications. These issues included missing values and outliers, unnormalized and imbalanced data, and the multicollinearity problem between the attributes. Multicollinearity occurs when two or more independent variables are highly correlated in a regression model, which might affect its stability. This is mainly a problem because we might not differentiate between the effects of the independent variables on the dependent variable. To handle these issues, we deleted missing values and outliers, applied the min-max normalization to control the attributes' different scales, and used SMOTE to solve the highly imbalanced problem. We also used the variance inflation factor (VIF) to solve the multicollinearity problem. Domain experts from KHCC help to identify the best subset of attributes to be removed from the dataset to enhance the stability and performance of the ML algorithms. We used classification models such as logistic regression, decision tree, k-nearest neighbors, Gaussian Naive Bayes, multilayer perceptron, and a stacking classifier combining all five classifiers to evaluate the compiled dataset. The stacking classifier outperformed the other base learners based on accuracy, sensitivity, and F1-score rates.

Ähnliche Arbeiten