Data exploration is the most human-centric step of the Data Science process: as such, it is the simplest to understand, but also the simplest to misunderstand. Behind straight-forward numbers and eye catching colourful charts, several traps are hidden.
But let’s start from the beginning.
Data Exploration - aka, Data Science for Humans
According to Wikipedia, Data Exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data.
Let’s dig into this definition.
Data Exploration is an approach similar to initial data analysis: actually, it is the initial data analysis. Exploration comes before any statistical analysis and machine learning model. This is critical to avoid an insidious danger: summary indicators, such as mean and standard deviation. The Simpson’s paradox is a well-known example which shows how global indicators may be superficial and misleading. It is of course an academic example but something similar may also happen in the real world, as you will see in a minute.
Data Exploration happens when a data analyst uses visual exploration to understand what is in a dataset: of course, it is more complex than this. Imagine reading a huge table, with thousands of rows and tens of columns, full of numbers. You are visually exploring the data but there is no way you may get some insights. That’s because we are not designed to crunch huge tables of numbers. We are great at reading the world in terms of shapes, dimensions and colours. And that’s what Data Visualization enables; once translated into lines, points, and angles, numbers are way easier to read.
Unfortunately, here comes a second danger: misdesigned or captious charts. Sometimes, the wrong visualization prevents Data Scientists from catching the correct insight or from sharing the correct information. A collection of great examples was published some weeks ago by Sarah Leo, from The Economist.
Data Exploration aims to investigate the characteristics of the data. To be more precise, it has two main goals:
- Highlight traits of single variables
- Uncover patterns and relationships between variables
Both goals are of paramount importance, as they guide the subsequent In-Depth Analysis. More than words, a real case study may help in proving this claim and showcasing the traps of Data Exploration.
A case study: temperature and power load
We will use a public dataset of Greek power load and air temperature. Available data covers 4 year with hourly granularity; for sake of simplicity, we will consider 2007 only. Let us suppose that we are developing power forecasting algorithms and we are interested in understanding if temperature may be beneficial.
After proper preprocessing, data looks like this:
A first trial may be computing the Pearson's linear correlation:
We get a sad 0.42. We may be tempted to neglect temperature and move on, but we are well aware of the danger hidden in summary indicator. Thus, we perform a proper visual analysis:
Now we can see that a clear relation is there, but it’s not linear, thus linear correlation cannot be effective in highlighting the pattern. However, a proper predictive model can. The chart saved us from drawing a very wrong conclusion and gave us a great hint for improving our models. However, the same chart is hiding something. If you look closely, you may notice something strange in the left part, just as if there were two different clouds of points. Let us change the plot a little bit:
The relation between power load and temperature changes with the hour of the day. This is another useful clue for designing effective models, but it was hidden behind a poor chart. Just adding the time of the day in the form of colour scale made the pattern evident.
We’ve shown how, in the real world, Data Exploration is critical to any Data Science project. As easy as it may seem, it hides insidious pitfalls which may prevent Data Scientists from unveiling the correct insights. In particular, the case study provided us with a few tips:
- Do not draw conclusions based on summary indicators
- Take care of your charts: the wrong one may fool you, while the right one may give you major hints
- Be human: listen to your intuition and investigate every time you feel something is strange