An Enhanced Standardization and Qualification Mechanism for Heterogeneous Healthcare Data

SOURCE: 2023 International Conference on Applied Mathematics & Computer Science (ICAMCS); Publisher: IEEE, Published: 20 February 2024

Eleftheria KouremenouGeorge ManiasShabbir Syed-AbdulDimosthenis Kyriazis


This study proposes an optimized machine learning (ML) methodology and workflow to examine pancreatic cancer factors, taking advantage of real-world data collected from three different hospitals. The overall proposed processing and analysis pipeline incorporates data transformation, cleaning, and mapping techniques such as translating specific values into a common language and calculating average blood result tests per patient. The ML models utilized under the scope of this research work are supervised learning techniques, such as Random Forest, LightGBM, XGBoost, SVM, and Gradient Boosting, by also considering and analyzing various risk factors such as demographic characteristics, drug use, surgeries, organ removal, blood values, and disease history of the patient. The models were evaluated and compared in terms of performance, considering important characteristics such as age, marriage, gender, and pre- existing diseases as risk factors for pancreatic cancer. The results indicate that the utilization of ML models offers a robust and comprehensive solution for pancreatic cancer risk prediction, considering a broad range of variables and risk factors. These models enhance the understanding and identification of the key risk factors associated with the development and progression of this rare type of cancer and can act as powerful tools in the hands of healthcare professionals in the fight against pancreatic cancer.

