Today, Data Science is nearly everywhere.
It is quite often that we enter a web page and are being recommended products other users have bought when we select a certain one. Or, how many times have you entered a search in Google and it was able to end the phrase for you?
But do we really know what Data Science and Big Data is about? Do we understand what it means? And overall, do we know how to face a Machine Learning project?
In our previous blog on Data Science: A New Approach For Problem Solving and Business Strategy we introduced the key terms and process for launching data science into the business. Now, we will describe key issues of the Data Science process and how to approach each phase. In this post we will explain how to set the stage for a Data Science project.
Common Challenges of new Data Science Projects
Business processes represent analytical objects with continuous growing complexity. The information to analyze comes from different sources of data and in different formats that require analysis as quickly as possible.
What are the challenges we face when putting a Data Science project into production in our companies? There is no concrete answer since each case is different (and should be treated as such) - but we can highlight some of the most common:
Lack of knowledge and specialized profiles, already in place organizations and technological architectures mostly designed for traditional BI projects, data volumes and variety are not the ones companies are used to manage, real-time data streaming is a brand topic for many.
All these challenges are mainly technical but probably the most important one to address is the ability to identify which is the business driver that directly impacts the business's income statement, increasing revenues and reducing costs. The most effective way to sell a Data Science project to the business is by demonstrating what kind of business problems it will solve and which will be the impact on company results.
In this scenario, it’s clear that the approach that we must consider in Data Science projects cannot be the same as we are used to in traditional Data Warehousing or Business Intelligence projects.
From our point of view, the most important thing when approaching this type of project is to be imaginative. We face new problems that cannot be solved with traditional approaches, hence we must face these projects with a mindset free of bias.
The most common methodologies used for Advanced Analytics projects start with a step called Problem Statement or Problem Shaping. This is a process of identifying the problem we want to solve and the business benefits we want to obtain. This is quite different from classic BI projects where the business problem is well known in advance.
How to do it? We must be able to ask ourselves a lot of questions, more importantly: the right questions.
The Golden Rule to define a project goal is to ask and refine "sharp" questions that are relevant, specific, and unambiguous; “How can I increase my profit?” is not a good question for any machine learning solution, “which kind of car in my fleet is going to fail first?” or “How much energy my production plant will consume in the next quarter?” are stronger examples of sharp questions.
If we want data to work for us, we must be able to ask the right questions. Once formulated, the data can provide great perspectives, good predictions and disclose a lot of knowledge.
Moreover, problem shaping is a typical “auto generative” process; similar to brainstorming, good innovative questions easily bring the team to produce additional smart questions, lateral thinking is a valuable soft skills in this project phase.
Another significant aspect is the ability to pass on the results given by the data. People naturally have biased opinions that affect how they perceive results, we must find the most effective way to “tell the story” about the data; this is a highly relevant step in a project success.
Are there tools that can help with definition of Problem Statement?
Problem statement is a step in the Data Science Process more dependent on soft skills (as opposed to technological or hard skills), nevertheless being based on questions and data, sometimes a lot of data, it is beneficial to have some data analysis tool… (sorry, big data analysis cannot and should not be done with Excel!)
In this project phase a key factor is the collaboration between data scientists and business users that, at the end of the day, are the ones with the widest business knowledge and are therefore, the ones who are going to set the path to success. In our experience, this collaboration is greatly facilitated by data visualization tools.
Data visualization tools like Qlik or Tableau typically have capabilities to directly access several kinds of structured and unstructured data sources, so they can be applied on top of raw data and are extremely effective in identifying trends, anomalies, outliers in analyzed data with a productivity level not comparable to a classical tabular approach.
As we said before, we must keep in mind that a Data Science project is definitely a Business project, so it must always be oriented on achieving results focused on the business and have a global vision aligned with the business strategy.
Traditional BI projects were typically set on long-term objectives so that the client often did not see results until the total completion; this, in many cases, produced deviations, both in terms of cost and in scope. Machine Learning projects must set short-term objectives and must be managed via agile approach, the loop between business questions, hypothesis and data evidence must be a continuous one, new findings must be used to drive and improve subsequent project waves and results, even when partial, need to be shared with business people to keep their commitment always at high level.
In Techedge experience we have found the use of Notebooks (Jupiter is the most commonly known but many others are available) as an effective tool to explain to business users what technical people are doing, what data are telling us and which results were obtained by applying models and algorithms - essentially creating a sort of “common ground” where we can mix technical info and business concepts in order to maintain vital project alignment.
To conclude and summarize on post main topic, for a good problem statement step you need to be curious, sharp and ALWAYS CREATIVE!