An Evaluation of Machine Learning Models coupled with Powerful Big Data Techniques in the Case of Pancreatic Cancer

Our research paper entitled “An Evaluation of Machine Learning Models coupled with Powerful Big Data Techniques in the Case of Pancreatic Cancer” has been published in the context of the Proceedings of 2023 International Conference on Applied Mathematics & Computer Science (ICAMCS). The presentation of this paper was held on 9th of August 2023, in Lefkada, Greece.

This study proposes an optimized machine learning (ML) methodology and workflow to examine pancreatic cancer factors, taking advantage of real-world data collected from three different hospitals. The overall proposed processing and analysis pipeline incorporates data transformation, cleaning, and mapping techniques such as translating specific values into a common language and calculating average blood result tests per patient. The ML models utilized under the scope of this research work are supervised learning techniques, such as Random Forest, LightGBM, XGBoost, SVM, and Gradient Boosting, by also considering and analyzing various risk factors such as demographic characteristics, drug use, surgeries, organ removal, blood values, and disease history of the patient. The models were evaluated and compared in terms of performance, considering important characteristics such as age, marriage, gender, and pre- existing diseases as risk factors for pancreatic cancer. The results indicate that the utilization of ML models offers a robust and comprehensive solution for pancreatic cancer risk prediction, considering a broad range of variables and risk factors. These models enhance the understanding and identification of the key risk factors associated with the development and progression of this rare type of cancer and can act as powerful tools in the hands of healthcare professionals in the fight against pancreatic cancer.

The results of our study offer a robust and comprehensive methodology for pancreatic cancer prediction, considering a broad range of variables and employing advanced machine learning models. The key features identified by our models – age, marital status, sex type, and various health conditions such as hyperlipidemia, Diabetes Type 2, peptic ulcer, chronic pancreatitis, cachexia, and gallstones – align with the current scientific understanding of significant risk factors for pancreatic cancer. The importance of utilizing big data techniques in health data should be emphasized. The more variables we want to include in an analysis, the more complex it becomes. The methodology proposed in this paper includes advanced techniques that can handle a large dataset, utilizing Apache Sedona and PySpark for parallel processing in preprocessing, cleaning, and mapping with custom-made dictionaries and functions.

Related Post

Leave a Reply

Your email address will not be published. Required fields are marked *