If you have been following our Data Science Process series of blog posts, you have read some of the most knowledgeable voices in the industry writing about the Data Science process and how to successfully approach the steps of this journey. Initial steps are sometimes overlooked, but they are critical for the overall result and they are the basis for the analysis and modeling of the solutions; without the correct problem framing, good data that is reliable and meaningful, ability to carefully listen to what data has to say, algorithms and machine learning models are almost worthless.
As Moustapha Cisse, Head of Google AI Center in Accra, Ghana says: “as we are what we eat, sometimes we feed our models junk food”.
Assuming to be healthy and having addressed the challenges posed by the previous steps, our journey reach another critical point. Machine Learning is one of the most powerful computing tools available and is helping all kinds of businesses to analyze their performance, reduce costs, maximize revenues and generate new business models.
Applying Machine Learning
Machine Learning is a series of techniques that allow computer systems to predict, classify, sort, make decisions and generally extract knowledge from data without the need for the rules to perform those tasks to be explicitly defined. It is an Artificial Intelligence subfield that aims to teach computers the ability to perform tasks based on examples without explicit programming.
Due to continuous growth of business complexity and amount of data to be analyzed we’re moving from a rule-based to data-driven intelligence as depicted in the below schema.
Data are used to train models and make them learning how to extract knowledge. Those models are trained using different algorithms that are determined by the kind of task we want our solution to perform.
Types of Machine Learning
Model selection is determined by the business problem. The problem framing phase helps us to define the kind of answers we are looking for: are we going to explain the current situation? Are we going to predict future outcomes? Are we trying to classify our business items?
Depending on those answers we will choose one or several models:
Just a single example for every model class:
- Spend Classification is the critical step of any Procurement analysis, spend transactions are classified into a finite number of classes on the basis of their characteristics
- Clustering is widely used in Marketing and Customer Relationship Management and allows to group customers into not a priori defined clusters on the basis of their shopping behavior
- Chatbots and Personal Digital Assistants leverage the capability of Natural Language Process (NLP) Models to interpret and understand what end users type or say
- Demand Forecasting allows providers to predict the quantity of energy they have to provide on the basis of previous years consumptions and other regressors like forecasted temperature or solar irradiation index
How does machine learn?
Machine learning processes are often assimilated to the ways in which children learn, being the example pretty effective we can use it for our aims.
A teacher explains to pupils how to write by providing examples of correct words.
Training datasets contain labels stating whether the single entry is “correct” or not (e.g. they have the expected output value or class already attached to them), models are trained to match the examples and learn how to recognize “correct situations” on new data as well. Examples of supervised learning are automatic translation systems. Google Translate switched to a Neural Network model in 2016 from the previous statistical model and reduced the code from 500,000 lines to 500.
Babies learn to pair forms and holes by playing with them and gaining experience; they learn that the object with four equal corners and four equals sides fits with the square hole.
In unsupervised learning the model is trained to extract characteristics common to several of the examples (training data) and then re-apply this knowledge to classify or to detect anomalies in new datasets.
Classification algorithms that sort different faces in pictures allowing them to be tagged like the ones available on Google, iOs Photos and some Social Network applications are examples of unsupervised learning.
Puppy education is based on this concept; when the puppy responds correctly to a command, it is rewarded with a cookie and over time it will be able to repeat the expected behavior even in a different environment from where it was trained.
In reinforcement learning, models are trained to take actions within a defined environment to maximize a reward or objective prize and to maintain this capability even when the environment changes.
Self driving systems like the Tesla Autopilot use reinforcement learning to move the steering wheel, accelerate and brake using the information from the car sensors to detect the environment and maximizing a function that keeps the car safe and in the lane.
The in-depth analysis is usually a cyclic process in which we define a model for our problem, test its results and refine the model (adding parameters, changing weights, adapting the architecture) or discard that model and try a new one.
In many cases, we may find that the best results ( that more or less means the minimum errors ) are not achieved by a single model, but by an ensemble of models working together. In this configuration, one model may predict or classify better a subset of the input data while another model perform greatly on another subset, getting better joint results than the individual models.
Machine Learning Toolbox
To make this iterative exploratory process more agile, we use Jupyter Notebook or similar cloud alternatives (Google Colab, Azure Notebooks, Amazon EMR or Sagemaker notebooks). In these notebooks we can combine text, graphics and Python or R code to run models and algorithms. Notebooks are beneficial because create a “common ground” for technical guys and business people; at Techedge we use them in the agile waves to check our progress and explain to customers what we have discovered in their data, to compare various models results and to identify which algorithms are the best to be used with certain datasets.
In recent years tools to automate training, validation and selection of algorithms have been released: brands like DataRobot or C3.ai promise to automate many Data Process steps to make them accessible to “non Data Scientists” community. Generally speaking this tools can add good value on specific tasks (e.g. model scoring and comparison) and present some obvious limitations due to their specific design and the automation they introduce, in summary we don’t see yet the opportunity to apply them wall-to-wall in a complex machine learning implementation.
To build models and execute algorithms we can leverage several ML libraries that can simplify the Data Scientist job in many different tasks; all of them are free and open source and provide us with a very powerful toolkit for building and running machine learning models.
Scikit-learn is a machine learning library for Python. We mainly use it for various classification, regression and clustering algorithms which can be considered “classical” machine learning algorithms (support vector machines or random forests just to name a few). It is a well-maintained, reliable and easy to use library that integrates perfectly with the Python numerical and scientific libraries NumPy and SciPy.
Tensorflow + Keras
Tensorflow is a symbolic math library, built for dataflow and differentiable programming across a range of tasks. It is probably the most used library for building and training neural networks. TensorFlow was first developed by the Google Brain team for internal Google use and then has been released as open source resource.
Tensoflow gives great performance, both in training and inference, but its syntax can be quite complicated. Keras is a neural-network library written in Python that runs on top of Tensorflow and abstracts the greatest part of the complication of building neural networks on Tensorflow.
Tensorflow 2.0 which has been released in alpha state in April, includes Keras as the default high-level API for building and training machine learning models.
Pytorch + Fastai
As with Tensorflow and Keras, Pytorch and Fastai are two Python libraries that complement each other quite nicely.
PyTorch is a Python machine learning library, mainly used for applications such as natural language processing and computer vision, originally developed by Facebook's artificial-intelligence research group. Fastai is a library that runs on top of Pytorch and simplifies the process of building, training and fine tuning machine learning models to the extreme, allowing us to prototype and build models really quickly.
To wrap-up, the in-depth analysis (aka modeling) applies the information obtained in the previous phases to define, train, select and execute the models to obtain the business answers we defined as our own target at the beginning of our Data Science process.
Always keep in mind that the objective is not to build a gorgeous machine learning model, but to solve a concrete, real business problem by exploiting the potential that the usage of data and digital innovation provide.