It has been a long discussion whether the use of traditional relational database management systems is valid for Big Data applications, where huge amounts of IoT data need to be ingested, while the query processing involves big data. Traditional relational databases come with the main drawback of not being able to scale their transactions. Due to this, new data management technologies have been emerged, that can be categorized as NoSQL datastores and lack the support for ACID transactions (delegating any consistency check and complex transaction concept on the applications) trading this off for scalability. Their drawback however is that they also lack support for rich query processing mechanisms so even if they can support a highly rated data ingestion flow by downgrading the needs for data consistency, they are incapable of performing analytics. To this end, they are often been used by popular analytical frameworks to delegate this work to them. This introduces the need for those frameworks to retrieve a vast amount of data from the NoSQL datastores and perform the analysis in memory, which requires a vast amount of computational and memory resources. In other words, they cannot use a database to push down these types of operations as closer to the storage as possible, which is mandatory for performing analytical processing efficiently.
To solve this problem, the most popular approach is the use of an additional data warehouse. By doing this, the NoSQL datastores can be used for primary data ingestion of the data, as their key-value nature and their lack of preserving transactions allow them to scale out horizontally as much as it is needed to serve the incoming workload. Therefore, they can be used as the primary storage of the raw data coming from various resources at any rate. However, as they are not capable of performing sophisticated query processing, the system integrators and architectures take the benefit of a data warehouse that can do this instead. In these cases, data is continuously being migrated from the NoSQL store to the data warehouse, by using expensive ETLs for data extraction, transformation, and finally loading. To make things worse, the execution of the ETL procedures happens periodically in batches (usually during the offload of the system by nights) and the data warehouses are now capable of being used for analytical purposes. The drawback with these types of architectures is that data that are being kept in the data warehouse are always outdated and do not reflect the current situation. In simpler words, the data analysts are performing their analysis over data that had been collected the previous day and cannot extract knowledge from the current view of the newly inserted data. This is a significant obstacle for real-time Business Intelligence (BI), especially in the healthcare sector.
In order to deal with this inherent problem, hybrid solutions have been emerged during the recent years. They provide Hybrid Analytical and Transactional Processing (HTAP). They need to provide both a scalable transactional processing mechanism in order to allow the service of incoming data loads at very high rates and analytical processing, in order to reduce the need for a data warehouse. The important thing to be mentioned is their ability to perform such types of analytics over the operational data.
The main goal of this Big Data Platform and Knowledge Management System of iHelp is to serve as the main data repository of the platform. Towards this, it will need to support data ingestion of external sources at very high rates, ensuring at the same time data consistency in terms of database transactions. Moreover, as it will be used for data retrieval by the analytical tools, it needs to offer a rich query processing mechanism in order for the data processing to be pushed down to the storage level, thus making the execution of the analytical algorithms more efficient. The main background technology that will be used for the development of the BigData Platform and Knowledge Management System will be the LeanXcale ultra-scalable datastore, with its ability for Hybrid Analytical and Transactional Processing and its dual SQL/NoSQL interface that allows combining the benefits of both relational and non-relational worlds.