Project Botticelli

Data Mining Model Building, Testing and Predicting with Microsoft SQL Server and Excel Purchase this course

30 June 2012 · 17 comments · 7783 views

The Full Data Mining Lifecycle

Data Mining Model Building, Testing, and Predicting  Microsoft SQL Server

This 1-hour-20-minute video, part of our Data Mining Series, discusses the entire lifecycle of a Data Mining Model. You will learn how to build models and mining structures, starting by creating a Data Source and Data Source View, how to train it with your data, and how to view the results. Most importantly, you will also understand how to verify a model’s validity, by applying tests of accuracy, reliability, and usefulness. You will understand, and you will also see being used, such key verification techniques as: a Lift Chart, Profit Chart, Classification Matrix, and Cross Validation. Finally, you will see how to predict unknown outcomes using your model.

We begin by introducing a well-known process methodology for data mining, known as CRISP-DM. The key to any data mining project is the understanding of its iterative nature: to gain a better business understanding you explore the meaning of your data, using mining models, and you necessarily modify those models, to get better, more accurate results. However, in the process of doing that, you change your understanding of the original issue, that you wanted to resolve, and the process continues. It is a loop, one that rewards you with intelligence, which helps you improve your organisation.

There are five stages in a data mining model’s lifecycle: Model Definition, Training, Testing, Use (Scoring), and Update, and we discuss those in detail, starting with Defining a Model. The main model definition concepts, such as Cases, Keys, Column Data and Content Types, Usage, Discretization and so on, have been introduced in another module of this course, Data Mining Concepts and Tools, please make sure to follow it, if you are new to data mining. As there are several ways to build models, we briefly discuss the merits of using SQL Server Data Tools (SSDT), which are more suited to project-focused work, where you return to the same model over a period of time, perfecting it, as opposed to working entirely within Microsoft Excel, which is simpler, and easier to use for a more casual data miner. In addition, you will hear, very briefly, about using PMML (Predictive Model Markup Language) and DMX (Data Mining Extensions) as alternative ways to define models.

Time to build a model! In the first series of demos you will see a Data Source and a Data Source View being created, so that we can use a simple, flat table of so-called customer signatures, which contains their demographic characteristics, such as age, occupation, and income, together with the known predictable outcomes, in our case the number of purchases each customer has made in our store, which we wish to study by building a mining model. Although our table comes from SQL Server, there is absolutely no limit where your data should be stored. You can use any database, spreadsheet, or even a text file. While you use SQL Server Analysis Services (SSAS) Multidimensional and Data Mining engine for analysis of this data, you are free to use any source for your data. The next step, creating a Mining Structure, together with a Mining Model, is the heart of the mining process. Once defined, a mining model needs to be trained using your data. You will see how we train our model, and how we visualise the results using a Decision Tree viewer. A common question is asked about the amount of necessary data for training. As Rafal explains, there are no fixed rules on a specific number of cases, however there are qualitative rules about the validity of the model, that must be fulfilled, which are often related to the amount of training data.

Testing and validation of models is of paramount importance. After all, what’s good of a working model if it cannot be trusted? You will hear an extensive explanation of testing of Model Accuracy, Reliability, and Usefulness, just before you see it being done on our model. The demo explains, and shows, at the same time, those most important validation techniques, including: Lift Charts, Profit Charts, Classification Matrices, and finally, Cross Validation for reliability testing.

If your model passes those tests, you may want to use it for making predictions about the unknown, by applying it to data on which you may want to perform predictions, which is sometimes called Model Scoring, or scoring data with a model. This is shown in two demos, both achieving the same results. First, using SSDT, and later in a simpler, but equally powerful way, using Excel, you will see how we use our just-created, and validated, model of customers’ characteristics to predict the most likely numbers of purchases that future, potential customers might make. This is a very common usage scenario, as you might want to apply a marketing list, and a mining model, to discover which customers are most likely to be interested in your products, and even to find out what is their profit potential to your company, impact on your credit, possibility of churn, or even risk of fraud.

If you would like to follow the demos, shown in this video, you will need access to a working installation of SQL Server Analysis Services (2012+, 2017 works well) and the database engine—get a trial if you don’t have it, or the free developer edition. You should also have access to Microsoft Office Excel 2013+ (2016 works well) and the free Data Mining Add-Ins, which come with its own mining data sample in an Excel workbook. You should also use Adventure Works DW datasets available from Codeplex, and you may want to use our own dataset, HappyCars, which is available for download from here.

Log in or purchase access below to the premium version of this content.

Purchase This Course or Full Access Subscription

Single Course

$250/once

Access this course for its lifetime*.
Purchase
Subscription

$480/year

Access all content on this site for 1 year.
Purchase
Group Purchase

from $480/year

For small business & enterprise.
Group Purchase

  • Redeem a prepaid code
  • Payment is instant and you will receive a tax invoice straight away.
  • Your satisfaction is paramount: we offer a no-quibble refund guarantee.

* We guarantee each course to be available for at least 2 years from today, unless marked above with a date as a Course Retiring.

Comments

hxy0135 · 14 August 2012

Hi Rafal,
Great lesson! I am looking forward to the next one.
I need to conduct Customer Profitability study and Product Profitability study. I hope I can get a quick start by focusing on algorithms that applies to the tasks
Could you please point me to the right direction on what algorithms I should use for each?

Thank you!

Rafal Lukawiecki · 14 August 2012

Hi Hua, thank you, very much for the feedback, I am glad you enjoyed this chapter. Customer and Product Profitability analysis is one of the more frequent applications of Data Mining, so thank you for asking about the algorithms. You can use several different ones, depending how you do the analysis, but it is often easiest to start with Decision Trees, and to progress through Clustering and into Neural Networks. Naïve Bayes can also help discover simple relationships between your inputs, if you have not done that. You might even do something a little more exotic, like Association Rules between profit and customer characteristics, but I think a DT will give you most at an early stage.

The main influence on the choice of the algorithm will be if you decide to discretize profit or to stick with a continuous value. Since you are likely to be predicting profit (i.e. profit becomes the Predict column, the output), you will get a very different Decision Tree depending on its Column Data Type. If it is continuous, you will see a regressive Decision Tree, that is one filled with formulas that attempt to express profit as a function of your inputs. If it is discrete (or discretized) the DT will show the more traditional tree of nodes of probability-based conditions. Both approaches work, and both are very useful. In the end, it is all about the process and the discoveries you make about your data, and I am sure you will learn a lot about your profitability in the course of doing mining. BTW, I hope we have a Decision Tree video module this autumn.

hxy0135 · 14 August 2012

Thank you Rafal for your quick reply!

hxy0135 · 14 August 2012

Hi Rafal,
We have over 1 million products in our catalog. And more than two million transactions in a year.. Among our transactions, there are 66% transactions consisting one item, 16% having two items, and 8% having three items. I really do not know where I should start with that many data. Any advice please?

Thank you,
Hua

Rafal Lukawiecki · 15 August 2012

Hi Hua, it sounds like you have a nice and a sensibly large dataset to analyse, but it also seems like you may have not prepared or reformatted it for mining before. Above all, you need to start by clarifying what is the business problem that you are trying to answer. Decide if you could answer it by using other forms of analytics—could you calculate the aggregate profitabilty of a product without mining first, just by building flat or per-product-category totals? If you wish to mine, express your business goal in terms of outputs whose relationship you want to model on your chosen inputs to the data mining process.

Based on the data you mentioned, it sounds to me like you are looking for a form of Market Basket Analysis with Profitability, since you did not mention any customer demographical etc information that would be used for Customer Profitability analysis, yet you mentioned the relationship of sales to product items.

After you have clearly definied your goal, you will need to prepare your data set:

  1. Clean it, by removing or relabeling any obvious outliers and errors, as much as possible,
  2. Flatten it, by converting relational schema data to simple flat rows representing Cases, as explained in the video, or, for convenience, a simple Nested Case, especially for a Nested Decision Tree or Association Rules for Market Basket Analysis (we will explain that in other videos),
  3. Sample it, if the set is too large for your chosen algorithm, or the number of inputs/outputs, or parametrisation.

However, one would rarely do it just once and be done with it. As you mine and analyse, you will repeatedly reformat your data, for example by grouping or clustering products into categories. Perhaps you need to partition the set into different subsets, representing different departments or data gathering processes. Always, you may want to derive new data from sales facts and mine the derived data as an input, but often also as a new output. Needless to say, we would love to help you with this process if you need a hands-on consultancy, because it is not practical to guide you by means of web site comments: there is a lot to ask and try out, usually 2–3 days of intensive work. On the other hand, you can learn a lot of it yourself, given time and practice. Please also have a look at some of the recommended books on the subject, and come back for new videos.

Just to get you started, take a small sample of your products, perhaps 1,000, and their related sales. Try to work in Excel when you are new. Keep data in PowerPivot because it is easy to derive new data that way by using Calculated Columns. Surface that data as a Flat Pivot Table, and mine that. Aim to form a flat table of Line Items consisting of a SaleID (Basket ID), a Product Name (or ID), and the Sale Profit (which you need to derive, I suspect, from the sales amount and cost of sale). Aim for 50–100k line items for now, just to see where it goes—if it works, you can scale to millions using SSDT, later. Use Data Mining Add-In Table Analysis Tools and the Market Basket Analysis button, or Data Mining Tab and create a simple Association Rules model. You can browse either of the models using the Data Mining Tab Browse button, make sure to look at the Dependency Network diagram.

Above all, know what you are looking for, and formalise your goal before you set-off. If, however, exploration is all you need at this stage, keep flattening the data and try different algorithms to learn how they function. Enjoy! 

hxy0135 · 16 August 2012

Dear Rafal,
Very much appreciate your very detailed reply!
I know have a long way to go. I am reading the books you recommended and hope I can quickly reach certain level to conduct the analysis I need to do. And You are right I should define my goal and have focus.

I have one more question please! Product Profitability analysis better uses aggregated data and Maket Basket analysis uses raw transaction line data?

Thank you very much!
Hua

Rafal Lukawiecki · 16 August 2012

Don’t get blocked, though, on trying to define perfect analytical goals, because they will change as you get answers from the process. Data Mining is iterative in this sense: we refine the question as we get answers.

Product Profitability analysis can be done in many ways, including using aggregated and raw data, but, in my experience, some level of aggregation always helps. The difficulty is finding the right level! For example, you may prefer to do an early, wide-ranging analysis aggregating profit at category level, and go more granular as you explore, ending even at the line item level. Market Basket Analysis can also be done at line item (transactional) level, but this can be very slow with larger data sets. I suggest that you start at product model level or higher, perhaps at product category level, and see what connections you find. Interestingly, you may find quite different associations at category level than at line item level. Ideally, you would do this analysis at all levels, provided you have the resources, good enough data, and a suitable goal.

joseusul · 24 September 2012

Dear Rafal,

I have some doubts regarding the topic of the Profit Chart Test.
For example, in a typical marketing campaign, besides the fixed costs;
Should we included asthe individual cost the salary of the person who contact with the customer? Should we allocate this cost between the different prospect customers calls?

With regard the Revenue per individual.
What does it exactly mean? For example, in the case of Happycars, is the average of cars revenues we have historically sold?
Shouldn’t we apply different weights according the type of car (luxury, sport,…)?

Thanks in advance for you time.

Kind regards, José.

Rafal Lukawiecki · 24 September 2012

Dear José,

Lift, and Profit Charts, are theoretical test techniques, that visualise the accuracy of a model being verified, in terms of that model’s ability to correctly predict an outcome for known training cases. To help visualise this information in a way that is easier to understand for a human interpreter, Profit Chart uses concepts of:

  • a hypothetical cost of contacting a customer,
  • a profit that could be generated when a correct customer was contacted because the model has made a correct prediction,
  • plus some “fixed cost,” effectively a an offset, for the entire experiment.

As you have observed, these numbers are basic, and they do not represent a realistically more complex scenario of a typical marketing campaign.

This test was not designed for a real-life use, as it represents an abstract, idealised “marketing” campaign. Indeed, if applied directly to a real-life campaign, it can be quite misleading! However, the theoretical simplicity of easy to understand Lift and Profit Charts makes them very useful for studying a mining model’s predictive power. There are other popular abstract tests, which are not always related to real-life situations: for example, a Receiver Operating Characteristic (ROC) curve is often used to talk about a statistical experiment, including marketing campaigns, although this technique originated in radar signal processing. Although it employs some of its early nomenclature, like sensitivity, ROC is now used for medicine, machine learning, and general statistics. In the same sense, you should not think of the Profit Chart as being specifically designed for marketing campaign evaluation, but as a generalised, and somewhat oversimplified (this is a good thing) way of evaluating mining models.

If you wish to more fully study the value of a mining model for a campaign, that you are working on, you should prepare a set of test cases, which represent different customers and expected known outcomes, use your model to calculate predictions, and evaluate the predicted results using your own plot or an Excel chart. In time, you will build several interesting sets of test data, and you will be able to study your campaigns expected outcomes with far more precision than a simple Profit Chart would support. However, start with a Lift and a Profit Chart at the early stages of modelling.

joseusul · 24 September 2012

Hi Rafal,

Thank you very much for you quick and very useful response.

Regards, José.

Austine · 7 October 2012

Hi. Thanks for this lesson. I am trying to predict bikes that my company would buy in the next five years as we are into bikes hire.
Do I have to include a date data set to it or just the hire data set? Your advice will save a soul from 2 days pains

Rafal Lukawiecki · 9 October 2012

Dear Austine, I’m afraid I couldn’t give you an easy answer without knowing more about your data, and about any reliable patterns of bike rentals that we could base any predictions on. To put it simply, you could try to make a prediction using a Time Series approach, based on aggregates at regular interval dates, but this technique, at best, would only predict a very short period, maybe a month, or a few months, but not a year, and most certainly not five years. Any long-term predictions, purely based on a time series, tend to be speculations, at best, because of unpredictable factors, such as: changing trends, competition, natural disasters, politics, economy, and so on. Short-term predictions, such as 1–5 periods ahead, can be done, provided that a historical pattern can be identified in your data. You can check for that using the Historical Model parameters of the Time Series algorithm. Even in that case, you have to be very careful about your predictions. I should add that even in statistics, long-term extrapolation of data through any form of regression is difficult and unreliable enough to be useless. 

On the other hand, you could base your predictions on other correlations which you may have in your data. For example, your bike demand may be a function of your rental prices. By stimulating the demand through price change you could influence that demand. I would expect that there is a fair number of such interesting correlations in your organisation’s business, not too mention “gut sense,” or insight. You should make a good use of those for predictions.

Above all, your request sounds like a more involved analytical quest than a quick prediction you may have been looking for. I would estimate it should take a good few days of work to find if it is possible, but making it work well could be a matter of a few weeks of effort spread over a few full seasonal cycles.

luchofarje · 22 April 2014

Hello Rafal, I'm from Sweden and I'm following these very educative videos. I got a warning and then an error when I click on mining accuracy chart - get results. These are:
When the max cases is set to 0 (zero), cross-validation all the data cases in the data set....
and then it seems that it's being processed but I got a windows error with no message "Failed to execute the query due to the following error:
The following system error occurred: "
Do I need some patch? is it a bug? or configuration that I'm missing.
I'm using SQL 2012 version evaluation edition.
Thanks for your answer.
Lucho

Rafal Lukawiecki · 22 April 2014

Hi Lucho, the first message (“When the max cases”) means that the cross-validation routine will use all of the cases from your training data set for this procedure. If you have a lot of data this can make it run for a long time—because, by default, this process will build 10 models using 10 partitioning schemes.

You would need to look into the details of the second error so I could help you more precisely (you only provided the top-level error message, not the detail). However, I can make a guess, that the problem is caused by a time-out parameter, configurable in SSDT, which means that the underlying cross-validation process query never completes before the time-out. This would be in line with a situation in which you have a lot of data. May I suggest that you, first, try reducing the number of cases used for cross-validation by changing that zero parameter to a number such as 100? If it completes, increase the number a bit, until you hit the limitation of your hardware/SQL installation, and then you need to change the query time out parameters. Good luck and regards to Sweden! :)

luchofarje · 24 April 2014

Hello Rafal, thanks for answering. I have changed the query timeout parameter and tested with different parameters but still get the error message. I don't know where to find more detail of this error because I have checked in windows logs, SSAS logs and SQL logs and I can't find any strange error message.
Could you please let me know which version level number are you using for the demo? I may have a suspicion that this is a bug.
Thanks for your answer,
Lucho

luchofarje · 24 April 2014

Hello Rafal, could you please recommend me a book to learn more about each data mining algorithm and its common usage scenarios? or where I may find more information about this topic.
Thanks in advanced,
Lucho

Mouhannad.Oweis · 13 April 2016

Thank you Rafal. Great Video :). A book reference would be great in my opinion. Thank you.

Online Courses