The continual evolution of corporate systems has contributed, even through Digital Transformation processes, to the introduction and optimization of new technologies dedicated to business analytics.
The main objective is to govern change within a market experiencing continuous and rapid growth, through the adoption of new technologies including Big Data, Artificial Intelligence and Machine Learning processes, and the increasingly popular corporate Data Lakes.
What is a Data Lake?
At first sight, the Data Lake and the Data Warehouse are similar: both systems have been conceived to make it possible to archive a large amount of data. To better understand the differences between Data Lake and Data Warehouse, we have to analyse their main purposes in greater detail.
The Data Lake is a shared repository which allows large quantities of data to be acquired and archived from heterogeneous systems in their native format, i.e. structured, semi-structured and non-structured raw data. The data can be acquired from legacy systems, such as CRM and ERP, as well as external sources such as feeds, IoT and social data. The scope of the Data Lake however is to provide a vision of not necessarily refined data to support Data Discovery activities, a feature that makes it suitable for experienced users.
On the contrary, the objective of the Data Warehouse is to provide, using tools for Business and Big Data Analytics, a unique corporate vision: a controlled and certified vision using appropriate ingestion processes, which are aimed at only storing data processed for a specific purpose and/or business process.
One of the key strengths of Data Lakes is the ability to store almost any type of data. This feature is even more evident when the data are acquired on an hourly or daily basis, through tree structures (if we think about a file system structure in “folders” and “subfolders”, organised by year, month, day, and hour if necessary). In the Data Lake, it is possible to obtain the history and subsequently recover the data without any performance degradation, unlike what could happen instead with the Data Warehouse for a large amount of data.
Main differences between Data Lake and Data Warehouse
There are a number of specific features which distinguish a Data Lake from a traditional Data Warehouse system, starting with the type of data acquired and its structure. Let’s sum up the main differences and analyse the most important ones.
|DATA LAKE||DATA WAREHOUSE|
|Structure of the data||Raw (structured, semi-structured and non structured)||Structured, processed|
|Purpose of the data||To be defined, defined
NB: It is possible that there are data for which the purpose has not been defined yet (for future use)
|Operating method||On Read||On Write|
|Users||Data Scientists||Business Users|
|Accessibility||High level of accessibility and simple to update||Access and updates more complicated and expensive|
|Storage||Limited costs and distributed storage (potentially expandable on a cloud)||Costs and review of the ingestion processes are expensive|
Structure of the Data: Raw vs. Processed
As mentioned already, one of the main features of the Data Lake is the ability to obtain raw data (data from individual sources in their native format) without worrying about defining a structure during the acquisition stage: the Data Lakes mainly store raw, non-processed data.
To feed a Data Warehouse however, it is necessary to perform a preliminary analysis in order to optimize the acquisition of the data itself using the classic ETL (Extract, Transform & Load) processes, during which additional data quality processes can be applied in addition to the transformation logistics.
All these raw data however present a risk: the Data Lakes may become segmented silos which, in the absence of an adequate data quality, governance and retention policy, risk thwarting the analyses of users and associated processes. Such a possibility is more probable than it may appear and must be carefully considered, both in the set-up stage as well as in the maintenance stage.
Users: Data Scientist vs. Business User
Let’s make this point clear immediately: performing analyses on the Data Lake is not for everyone.
It’s a matter of fact that the main vendors of Business Intelligence & Analytics tools - such as Microsoft, Oracle and Tableau, to quote a few - are working at a fast pace to provide more and more connectors that make the data sources transparent (Data Lake, relational database, Data Warehouse and streaming flows).
Despite this, the fact remains that the Data Lake requires more skills and is therefore aimed at an advanced user.
On the contrary, dashboards and reports provided by Data Warehouses and Data Mart can be used by a wider audience, which has the analysis of information on specific business processes and metrics as its main objective.
Data Lake and Data Warehouse: constraints or opportunities?
In this brief article, we have listed the main features of Data Lakes and Data Warehouses. Having arrived at this point, it is worth asking ourselves: “What is the most suitable solution for my company? What should I choose?”
The answer is: "You do not need to choose!"
Contrary to what you may think, the two technologies are not in competition between themselves - they instead complement each other. In recent years, thanks largely to the consolidation of cloud services (AWS and Azure in particular), the paradigm linked to reporting systems has continued to evolve, introducing new concepts and architectures which are the basis of technologies linked to data lakes, big data and data warehouses. This merger gave birth to the “Modern Data Warehouse” and the “Real Time Data Warehouse”, which involve the Data Lakes and the Big Data modules in the first level of integration.