Project Botticelli

Data Science Concepts: Machine Learning and Models Purchase this course

16 December 2014 · 2412 views

Fundamentals, Part 2

Rafal discusses confusion (classification) matrix and prediction thresholds

This 1-hour module, by Rafal, introduces the essence of data science: machine learning and its algorithms, modelling and model validation. Data science differs from traditional, statistics-driven approach to data analysis in that it extensively uses those algorithms for the detection of patterns that help us build predictive models. In itself, this is not knew: data mining is an older name, indeed a discipline focused on the use of machine learning for data analysis! The difference between the two is subtle, as machine learning, in principle, does not need to be applied to typical, tabular data—machine learning algorithms can be used without tables of data, for example working as part of an artificial intelligence subsystem of a game. However, in the context of data science and analytics, we usually analyse cases stored in tables (see the previous module for a full explanation of this concept). Once we apply machine learning algorithms to look for patterns in such data, we are, effectively, doing data mining.

There are three major steps when building models with machine learning: model definition, training, and its validation. An important part of your job will involve selecting the most appropriate algorithms when you execute these three steps. Almost all machine learning algorithms can be grouped into a few high-level classes including: classifiers, clustering, regressions, and recommenders. You will find several examples of different algorithms belonging to each of these classes in SQL Server Analysis Services, the open-source R software, cloud-based Azure Machine Learning and in the Mahout library that can be run in Hadoop. This video briefly shows examples of each, starting by showing you where to find them in SSAS, which is explained in much more detail in our separate course on data mining with SSAS. You will also see how to use Rattle in R to build a simple decision tree, and how to use a sample outlier detection model in Azure ML—although this is the subject of the next few modules of this course. Mahout and Hadoop use yet another approach to modelling, which is briefly mentioned whilst discussing a simple yet very useful data format known as a triple, particularly useful for preference and recommendation modelling, for example by using collaborative filtering and matchbox recommenders, explained later in this course.

Successfully building a model is only the beginning: you must validate your model for accuracy, reliability, and usefulness before proceeding. Lift charts, precision-recall and ROC (Receiver Operating Characteristics) plots help you test model’s accuracy. However, at the heart of accuracy validation lies a very basic, simple, yet immensely powerful concept: a confusion matrix, also known as a classification matrix, from which almost every metric of model behaviour can be derived, including precision, recall etc. It is also useful for determining the most appropriate threshold, or cut-off value of the predictable outcome’s probability, so as to best balance the likely number of false positives against false negatives which are often driven by business decisions.

Reliability testing requires additional approaches, such as partitioning (also known as k-folding), while usefulness of a model can only be assesed by a human, domain expert. This is made easier by visualising a model which is very easy in SSAS, somewhat easy in R, but unfortunately not so easy in Azure ML at this stage—you may want to use all of those three tools interchangeably whilst building your experiments and models. The final stage of model validation is also a prelude to its real-life use, and involves designing a real experiment, such as an AB test.

If your modelling and experimentation succeeds, it is time to consider using the model to instigate the desired business change. Following that, but also in case you did not succeed, you would usually need to iterate the entire process on a regular basis, because the model and its underlying data will change over time. Above all, making this iterative process part of the way your business is run is key towards helping your organisation run on meaningful and reliable analytics.

Log in or purchase access below to the premium version of this content.