Recently, we discussed how to frame a problem in data science projects (which is drastically different from traditional BI requirement gathering) and the process and tools related to raw data collection. Now that we know how to ask the right questions to frame the business problem and we have been able to collect the raw data we can count on, it is time to start a crucial activity: Data Processing.
In order to examine data at a high-level, the Data Engineer must differentiate the stages of manipulation or processing of the data to propose the architecture and components that best fit the needs of each stage.
What stages are we talking about? To keep the story short and simple we say
- Data Quality and Cleansing
- Data storage
The data processing we focus on in this post are preparatory to the following steps in the Data Science Process (see the first post of the series), namely Data Exploration and In-Depth Analysis.
Data Quality and Cleansing
Data have a whole set of properties that define them: type, format, access, availability, volume, nature... All these properties will influence the definition of the components of the architecture and the processes to capture and manipulate data.
Starting from raw data, which we want to keep as they are to save a “digital picture” of the data we originated with, Data Engineers begin a high-level analysis, typically searching for formal errors like wrong data types (e.g. I was expecting a date but find a number, or inconsistent codes and classifications, missing values, etc) and apply techniques to clean data and enhance quality (e.g. interpolation to estimate missing values).
Here is where theory and real world could clash: is there a sharp frontier between the tasks of a Data Engineer and a Data Scientist? What I mean is that data cleansing activities can be carried out from a pure “formal” perspective (examples above) or with an eye on the business problem we want to solve. For instance, smoothing peaks and valleys in a data series is generally a good practice in forecasting processes but I could be facing a business case in which peaks are interesting events I want to investigate rather than cut them.
With the assumption that you’re working with both competences in the team (otherwise you have improper staffing), it’s not a critical issue and can be solved with old fashioned teamwork. Our suggestion is to have Scientists to give a look and support in Data Quality tasks performed by Engineers, at the end of the day four eyes are better than two.
In terms of tools and skills, there are many data integration platforms that include strong quality features (Oracle, Informatica, IBM, Talend - just to name the most commonly used on the market) that shorten the time required to perform specific tasks, especially those aimed at data standardization (e.g. names, addresses, phone numbers normalization). Apart from these, quite often we see languages and frameworks such as Python, R and Scala used when data cleansing requirements are more “exotic” and condition based. Obviously, Cloud Platforms provide services to support data management activities so we can implement any kind of architecture, on-premise, full cloud and hybrid.
Another important decision that the Data Engineer has to make is about the repository - where to store extracted data since s/he has to choose between a variety of options: files, relational databases, NoSQL databases (columns, documentaries, graphs), key-value and XML - just to mention the most common.
One typical dilemma comes when we have to choose between Relational or NoSQL Databases.
NoSQL (non-relational) databases have the advantage of being more scalable and are designed to support distributed structures, they are generally faster, adaptable and more flexible than Relational databases. Moreover, being document-structured and storing information in a folder-like hierarchy, Non-relational databases are the best option when it comes to store unstructured data like documents, images, tweets and any kind of data difficult to fit in a “column by row” structure.
On the other hand, NoSQL databases are not 100% compatible with standard SQL and more notably - can’t grant absolute data integrity (that is the data warehouse nirvana). On top of this if your analytical requirements are satisfied by structured data and you don’t expect your business to grow exponentially, you could easily find support in a Relational DB.
Not all the machine learning solutions need big data (wow, I said it) - it actually depends on the nature of the business problem and consequently on the kind of data you have to work on. The good news is that you can have a data architecture where the two kinds of databases coexist leveraging the best support for various kinds of workloads and storage types.
Current technological innovation allows us to leverage distributed (replicated) storage, which basically consists of distributing copies of the data on different servers, which enable us to improve flexibility, performance and scalability, solving problems in the event of system crashes. All this does not require much knowledge by Data Engineers since it is something that is mainly done transparently by the database itself, especially if we are using Cloud Services like Amazon Aurora, Google Spanner or Microsoft Cosmos DB .
Choose the most appropriate architecture for your solution
Once we have the data, the Data Scientist can begin his/her analysis processes to reach the proposed objectives. To implement these processes, we can use the same programming languages and frameworks used to source data or even use comprehensive tools such as Knime, RapidMiner, DataRobot, SAS Visual Data Mining and Machine Learning, but let’s stick to our original purpose, let’s talk about data and processing architecture.
The way in which analytical processes are arranged and executed introduces concepts such as distributed systems and parallel systems, whose difference, in general terms, is that a parallel system can be defined as one that divides a process into tasks that are executed at the same time while the distributed system divides a process into tasks that are executed in different locations using different resources. It is interesting to know the way in which the processes are executed to enhance their possibilities with the scaling of resources.
Results of our analysis and execution of our algorithms can be integrated and stored back into the same systems we sourced input data, stored in our new repository (potentially a data lake) or even into a dedicated new analytical repository that allows a more agile and quick exploitation model.
It is recommended that Data Engineers build and maintain a data dictionary, with all the definition of the sources, loading processes, transformations, KPIs, ... and in general all the information that we need to collect, since then it can be very useful when drawing or leveraging this information for subsequent analysis.
Now, once Data Scientists have identified the best algorithm to address the business needs and obtained high quality results, it is time to ask ourselves questions regarding visibility, accessibility and reliability of the information we produced. As we stated above, sometimes it’s a matter of “where to store the data” but more and more it’s a matter of “how the users will access the data” and we could face a variety of scenarios.
It could be that the new knowledge inferred by applying Data Science techniques will be used to support the analysis process performed by specialized internal teams but it could also be that the information has to be consumed by third party applications, even transactional ones (think about an algorithm to suggest Next Best Offering in a sales process). In these cases, we will need to implement mechanisms and protocols to access information with communication interfaces like APIs or develop publication mechanisms by using technologies (Apache Beam to name one).
To wrap up, even in a Data Science Process step like Data Processing, that could look less complex than others at first sight, the quantity of technical crossroads, decisions to take, options to select and technologies involved it’s so huge that it’s very rare to find all required skills in a single person, so we get another confirmation that Data Science it’s definitely a team sport!