Project Botticelli

Decision Trees in Depth Purchase this course

29 March 2013 · 6 comments · 4817 views

Classification, tree and linear regression, and associative analysis

Microsoft Decision Trees

Decision Trees are the most useful Microsoft data mining technique: they are easy to use, simple to interpret, and they work fast, even on very large data sets. In essence, a decision tree is just a tree of nodes. Each node represents a logical decision, which you can just think of as a choice of a value of one of your inputs that would make the most profound difference to the output that you wish to study. Once you try a decision tree a few times, you will realise how easy, and useful they are to help you understand any sets of data. This almost 2-hour, in-depth video by Rafal starts with an explanation of the three key uses of decision trees, which are: data classification, regression, and associative analysis, and then takes you on a comprehensive tour of this data mining algorithm, covering it in slides and detailed, hi-def demos. As this is a large module, make sure to use the “Jump to chapter” links, in the right-hand column of the page.

You can create a decision tree in several ways. It is simplest to start in Excel, using the Classify button on the Data Mining ribbon, as shown in the first demo, in which you can see how to classify customers in terms of their long-term loyalty to a retailer, as measured by the number of life-time purchases. It is, however, more convenient to use SQL Server Data Tools (SSDT) to work with your decision trees on an ongoing basis, especially if you plan to change parameters, or you want to experiment with different content types, for example changing from discrete to continuous data, and so on. Rafal shows you the just-introduced version of this tool, now based on the shell of Visual Studio 2012.

Microsoft Decision Trees behave as three related, but significantly different techniques. The simplest, flattened-data, that is a case-level decision tree, is the one that you might use most often. A more advanced form of the trees uses nested cases to perform associative analysis, which is similar in nature to the Association Rules algorithm. It is used to find relationships between case-level attributes and the values of the nested key, as well as relationships between those keys. This technique builds a forest of decision trees, one for each value of the nested key, and then looks for relationships between the nodes of the trees in that forest. For example, you could use this technique to analyse customers and their demographical information (case-level inputs) and the purchases made by those customers (nested cases), as is shown in the extensive demo.

The third form of the trees is known as Regressive Decision Trees and it is used to model continuos data, such as income, profit, or sales, as opposed to discrete, or discretised data—if you are not sure what those terms mean, follow our Data Mining Concepts and Tools tutorial. Regressive trees are based on the well-known statistical concept of regression analysis, which creates a formula to predict an outcome by means of a mathematical function of known, continuous inputs. There is, however, an additional benefit of using a regressive decision tree to a simple regression formula. A tree is capable of including discrete data in a clever way: instead of building one formula, the tree is actually a tree of regression formulas, where each node is formed like in a traditional decision tree, by means of making the best split in the tree, based on the input that provides the most information, or, in other words, that has the largest impact on the predictable outcome. This is, conceptually, related to splines. Our demo briefly shows how to test such a model, before using it, within Excel, to perform a live prediction (scoring) of profit potential for a set of prospective customers. Incidentally, the Microsoft Linear Regression algorithm is simply a Regressive Decision Tree without any children, that is with only one, top-level, root node!

To get the most from Microsoft Decision Trees, you can parametrise them. The COMPLEXITY_PENALTY parameter helps you build a bushier, often easier to understand tree, or a slender, deeper tree, that may be more accurate, yet harder to read, in some cases. SPLIT_METHOD makes it possible to build binary trees, where each node has exactly two children, or complete trees, where each node represents all possible (and meaningful) values. SCORE_METHOD is the most interesting, but perhaps the least useful parameter, as it entirely changes the tree building process by using a different formula for deciding when to make a split, that is when to create a new node, and how to select the most meaningful attribute (input column). There are three options that you can use: Entropy, Bayesian with K2 Prior, and Bayesian Dirichlet Equivalent with Uniform Prior (BDE). The entropy technique is the simplest, and it finds attributes that have the largest chance to make a difference to the output, but it disregards prior knowledge, already encoded in the higher levels of the tree, therefore it can be somewhat blind to what a person would consider an important hindsight. The remaining two methods use that knowledge, referred to in data mining as priors, but they do it in a slightly different way. K2 uses a constant value, while BDE creates a weighted support for each predictable state based on the level in the tree and node support. Our video also explains the remaining parameters, which are more generic in nature: MAXIMUM_INPUT_ATTRIBUTES, MINIMUM_INPUT_ATTRIBUTES, and MINIMUM_SUPPORT.

Log in or purchase access below to the premium version of this content.

Purchase This Course or Full Access Subscription

Single Course


Access this course for its lifetime*.
Subscription Best Value


Access all content on this site for 1 year.
Group Purchase

from $480/year

For small business & enterprise.
Group Purchase

  • Redeem a prepaid code
  • Payment is instant and you will receive a tax invoice straight away.
  • Your satisfaction is paramount: we offer a no-quibble refund guarantee.

* We guarantee each course to be available for at least 2 years from today, unless marked above with a date as a Course Retiring.


alberto.gastaldo · 18 December 2013

really cool video. Very clear and useful for me !

However I cannot understand when- in a dependency network - a relationship between nodes is mono or bi-directional.

Say that some people who bought a Bottle also bought bottle cage and some other did not (they just bought one of those products but not both).

Given this scenario, why node A predict node B and node B does not predict node A ? It looks like these two nodes should always have a bidirectional connection but this is not the case...
I am probably missing some concepts :-(

Thanks if you can clarify this point.

Alberto - Italy

Rafal Lukawiecki · 19 December 2013

Alberto, you have asked a very good question. A simple answer is that if we did not have an additional metric to consider, the relationship would always be bi-directional, if people buy both products—precisely as you have expected.

However, there is a metric of the strength of this relationship, which is based on the conditional probability of buying one product given the other has been purchased. This value depends on how many times one product has been sold with the other product vs. how many times they sold without the other, perhaps on their own. Since that is often asymmetrical (people may buy bottles on their own more often than together with a cage), the relationship becomes stronger in one direction.

The final clue to this puzzle are the labels on the slider shown in the visualisation on the left. As you move it up towards All Links, or down to Strongest Links, you are adding or removing arrows representing this relationship. It is possible to have one product point to another, and, as you move the slider up, the arrow suddenly turns bi-directional, meaning that the oppposite holds, but is weaker than the first relationship was.

You will see a very similar situation when you use the Association Rules algorithm, which will be explaineded in my next video. 

dknipple520 · 27 May 2015

I am confused at about 27:30, where you are demonstrating how to add a model to your structure and you are selecting the variables to be predicted. You mention that you select "Input and Predict", because whatever variable is being predicted will also be used as an input to predict itself. This does not make sense to me, as it seems there would be a 1:1 correlation between the input and predicted variables, and I don't understand how that would add value to the model. Could you clarify?

Rafal Lukawiecki · 27 May 2015

Deanne, thanks for asking. It is the underlying principle of the way data mining supervised learning algorithms work that they require to access the values of the outcome (also known as output, predictable, label, response variable etc) of every case (observation) in order to build a model. There is no way supervised learning could work without that. Clustering, on the other hand, is an example of unsupervised learning, that is algorithms that do not require a predictable outcome while building a model.

There is no risk of the algorithm trying to build a correlation of the same variable with itself. It is looking for a relationship between other inputs and the output variable in question. However, if you selected several outputs, variables marked as Predict will also be used as inputs for predicting other predictable outputs, while those marked as PredictOnly will not be used as inputs to predict other predictable outputs.

goran_banjac · 18 June 2015

Hi Rafa,
Great material.
I have a question. Your first two models are "Number Of Purchases" and "Care Sales Flat". In first model you have single row per customer and you are predicting possible purchase. Second model is flat table but our are predicting model, type,...
My question is how valid and valuable would be to predict number of purchase using flat table model? I'm trying to create DT model that predict patient return to hospital, and if I don't sum number of diagnosis in order to retain single row per patient, could I have good model if I flatten out table and have as many patient rows as patient has diagnosis, and simple mark patient as 1 if he was previously returned and 0 if not?
I'd like to accomplish the model, where I can see what diagnosis had impact on patient return versus number of diagnosis. Thx!

Rafal Lukawiecki · 18 June 2015

Thanks for asking, Goran. This is a very good question! If you want to model (and predict) the number of purchases, then the first model shown in the demo (Number of Purchases) is the one to use, because the predictable outcome is that number. The “flat” model does something different: it predicts the brand of the car, not the number of purchases. The nested DT would be able to also show you the connections (associations) between the brands and other inputs.

In your case, however, for predicting the risk of readmission, you could take three approaches. First of all, you could have multiple rows per patient, with each diagnosis on a different row, plus the readmission flag as output. Secondly, you could aggregate all the diagnoses as one number, and so have all in one row, with the readmission flag as the outcome, as before. Finally, you could create a flat row with separate columns for each type of a diagnosis, which could be a boolean flag, or an aggregate: a count of how many times this diagnosis the patient had—and the readmission flag, again, as the outcome (label). All would work and would give you a different model, and without knowing your data and patterns I cannot tell which would be the most accurate and reliable, so you should try them all.

Having said that, I feel that the third approach would let you build several classifiers using different algorithms most easily, perhaps you could also try that with Azure ML and a boosted decision tree or some of the other classifiers, notably an SVM and a Bayes Points Machine. On the other hand, the first approach could be the most reliable, however, it would have the least explanatory power, since information about the diagnosis was lost, and I suspect it would not be too accurate. The second approach, which you have suggested, would be roughly equivalent to the 3rd one (with a boolean flag) because of the way SSAS DM works, but the one with the count (not flag) of the diagnosis-per-patient would be different, possibly more accurate. Contact me, please, if there is any way I could help you take this further, there are other ways to do it, too. And thanks for asking!

Online Courses