<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=974250883405448&amp;ev=PageView&amp;noscript=1">
Data Science Process: Raw Data Collection

Artificial Intelligence


Data Intelligence

Data Science Process: Raw Data Collection

Stefano Oddone | Apr 17, 2019

Being data the cornerstone of every analytics activity, we need to invest a relevant part of our time in understanding them, in this post I’ll focus specifically on where to find them, what techniques and tools are most useful and, more importantly, what are the skills to have in place to be successful in this key step of the Data Science Process.

Data types and data sources

There are so many attributes that could be used to classify data, in my humble opinion one of the most relevant is internal/external. Companies tend to put a lot of emphasis on their internal data because they are available, apparently cheap and moreover, are considered “the truth”; having the mission to keep this post decently short I’ll avoid to articulate on the last statement. Having said that, I would like to recall your attention to the fact that the vast majority of data about the business you’re running are produced outside your firewall: market data, competitors data, customers data, prospects data, analyst reports, consumers blogs, end users forums, tweets - when used together, unavoidably produce a more well-defined, clear and useful picture of the overall scenario than any internal source ever could.    

Now, if you concede me that external data could be useful to understand your own business, you have to consider the difference between public and private data. If data are public, all your competitors have the same opportunity you have to leverage them to improve analysis; what could make a difference is how you decide to make use of them. If you provide me and a Master Chef with the very same ingredients and recipes I can ensure you that the final result will be very different, the competitive advantage clearly resides in the different ability to exploit the same ingredients.

On the other side, if you know things that your competitors ignore, this is a potential competitive advantage - I say potential with respect of what I just stated above - but generally speaking “knowledge is power”, so we can assume that the more you know, the better you decide.  

Private data (e.g. personal shopping behaviors, position tracking, services subscriptions,...) have some small side effects: you have to pay for them (I exclude to have hackers in my readers community), they’re more complex to update and maintain, their future availability is not always certain so they could be useful for point in time analysis but could be risky to insert them in a long term data strategy. Ah, I almost forgot: private data tend to provide data privacy problems, please keep you Legal Department on board from the very beginning.

Data Collecting Tools

Data collection is a key step of every Data Analytics journey. Fortunately there are many helpful tools to efficiently manage this task, ranging from typical ETL or ELT tools like Oracle Data Integrator, IBM DataStage, Microsoft DTS to Cloud oriented data integration tools like Talend or Azure Data Factory. At Techedge we’re proficient and experienced in querying, filtering, cleaning, transforming and finally storing data - both for small data marts and huge data lakes.

When it comes to real-time data streams things are pretty different - there’s no time to transform them. Differently from batch data flows, the data quality tasks are better described as “noise reduction” activities and technologies used are very specific: Kafka, Azure Event Hubs, AWS Kinesis, Google Cloud Dataflow are your best friends to have this kind of jobs done.

If need to source data from websites and blogs (web scraping) again you can find the usage of web tools like Mozenda or Octoparse beneficial, and if you have very specific needs (like email address extraction, image extraction or phone number extraction) it’s easy to find dedicated tools for your tasks.  

Now that I have mentioned the most successful data integration solutions, I can reveal a little secret: all of these technologies are great, very helpful to increase productivity, reliability and traceability… but if one should happen to be “in a hurry” (and sometimes, it happens) please consider that an experienced code master in “Tasmanian Devil mode” can be surprisingly fast and efficient producing Python, Java or Scala code that smoothly ingest your data sources, any sources.

Let’s summarize: “Long life at market leading platforms and out-of-the-box services...in the realm of Coding Knights”    

Required skills

Yes, we are in the Cloud era where most of the data are unstructured and resides in text, images, videos, clickstreams (it’s a relatively short period that we order facts into rows and columns if you consider that Sumeri are acknowledged for inventing written language back in 4000 b.C.) but to me SQL and Relational Database Theory is still a must - it’s a fundamental knowledge that every Data Engineer needs to have. You will invest in the most innovative and shiny data integration technologies but it will be the old and wise Structured Query Language to be there looking at you with his good-natured and reassuring smile.

for more on this topic, don't forget to check out the previous blog on How to Define a Problem Statement in the Data Science Process.


Interested in learning more?

Are you ready to start deploying a more scientific, data-driven strategy to your business operations? View our services and solutions for data intelligence.