EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop Ecosystem

Our research paper entitled “EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop Ecosystem” has been published in the Journal “Information”, IF: 3.1.

The purpose of this research is to develop and deploy EverAnalyzer, a flexible Big Data management platform capable of automatically gathering, pre-processing, processing, and analyzing both real-time (i.e., streaming) and stored (i.e., batch) data. Nevertheless, most of the existing Big Data management platforms already support such a pipeline, exploiting, however, off-the-shelf technologies and tools. In addition, these platforms support tools that perform standalone tasks, such as individual data processing or individual data analysis tasks. Hence, using those platforms, specific frameworks are exploited, having their own set of benefits, shortcomings, and limitations. The solution to this problem is the implementation of a system that can comprehend the advantages and disadvantages of the various tools used to manage diverse case datasets for pursuing a processing or analytical activity and identify the optimum tool per case for performing less time-consuming and more efficient actions. EverAnalyzer comes to bridge exactly this gap, providing the innovation that enables its system to automatically recognize which of the underlying data processing (i.e., MapReduce or Spark) and data analysis (i.e., Mahout or MLlib) tools are most suitable for successfully and efficiently processing and analyzing the ingested data. The system’s choice is influenced not only by the amount of data, but also by the execution speed of prior processing and analysis tasks that have been applied on relevant data scenarios. As a result, EverAnalyzer may be applied to a wide range of scenarios, better assisting users in both processing and analytical activities, hence decreasing their overall workload.

To verify all of the above, the platform was evaluated through an experiment that assesses EverAnalyzer’s capability to provide empirical suggestions to its users about the best framework to be utilized for the operations that they wish to perform. Data were collected from thirty (30) distinct datasets related to various diseases and conditions in the healthcare sector. The data were pre-processed, processed, and analyzed, while EverAnalyzer provided a suggestion for the most suitable framework (i.e., MapReduce or Spark for processing tasks, and Mahout or MLlib for analysis tasks, respectively) based on the shortest execution time for the requested processing/analysis process. All the framework’s suggestions were gathered and compared with the framework that had the best execution time between the two chosen tools, revealing that EverAnalyzer made a correct recommendation 80% of the time. However, when the number of datasets increased, this percentage appeared to climb monotonically. This means that each performed processing/analysis task trains EverAnalyzer to export better and more representative results. Hence, if the platform uses a larger number of datasets, it is expected that the percentage of correct answers will be increased, raising the overall platform’s accuracy to a percentage greater than 80%.

Related Post

Leave a Reply

Your email address will not be published. Required fields are marked *