An Optimized Pipeline for the Processing of Healthcare Data towards the Creation of Holistic Health Records

SOURCE: 2023 International Conference on Applied Mathematics & Computer Science (ICAMCS); Publisher: IEEE, Published: 20 February 2024

An Optimized Pipeline for the Processing of Healthcare Data towards the Creation of Holistic Health Records

George ManiasEleftheria KouremenouAinhoa Azqueta AlzúazPavlos KranasFabio MelilloDimosthenis Kyriazis


The tremendous increase in the generation, distribution, and utilization of healthcare data over the past decade highlights the imperative for all stakeholders in the modern healthcare domain to integrate advanced and cutting-edge analytical techniques to extract valuable knowledge and insights from them. By harnessing the power of Big Data, Artificial Intelligence (AI), and Machine Learning (ML), and by conducting in-depth analyses, healthcare organizations can enable personalized healthcare and enhance risk assessment. Consequently, healthcare professionals can identify eligible patients for tailored treatments, leading to time and cost savings. Considering that healthcare data are collected from different sources and are made available in divergent formats there is a growing demand for implementing and utilizing approaches and tools that leverage the potentials that can be derived through the optimized qualification and standardization of the raw collected healthcare data. To this end, the approach presented in this paper introduces a novel pipeline for the cleaning, qualification, and standardization of the collected primary and secondary data types as evaluated in the context of the EU-funded project iHELP and its five different real-world use cases. In this paper, a refined and robust pipeline is introduced that integrates three (3) different components: the Data Cleaner, the Data Qualifier, and the Data Harmonizer. Its overall validation is based on the implementation of the data cleaning, qualification, and harmonization tasks by leveraging real-world data related to pancreatic cancer. The outcomes of its utilization showcase that the collected data are processed in a consistent manner irrespective of the source, being fully cleaned, highly reliable, and interoperable with each other.

Related Post