Decision Trees are the most useful Microsoft data mining technique: they are easy to use, simple to interpret, and they work fast, even on very large data sets. In essence, a decision tree is just a tree of nodes. Each node represents a logical decision, which you can just think of as a choice of a value of one of your inputs that would make the most profound diﬀerence to the output that you wish to study. Once you try a decision tree a few times, you will realise how easy, and useful they are to help you understand any sets of data. This almost 2-hour, in-depth video by Rafal starts with an explanation of the three key uses of decision trees, which are: data classiﬁcation, regression, and associative analysis, and then takes you on a comprehensive tour of this data mining algorithm, covering it in slides and detailed, hi-def demos. As this is a large module, make sure to use the “Jump to chapter” links, in the right-hand column of the page.

You can create a decision tree in several ways. It is simplest to start in Excel, using the Classify button on the Data Mining ribbon, as shown in the ﬁrst demo, in which you can see how to classify customers in terms of their long-term loyalty to a retailer, as measured by the number of life-time purchases. It is, however, more convenient to use SQL Server Data Tools (SSDT) to work with your decision trees on an ongoing basis, especially if you plan to change parameters, or you want to experiment with diﬀerent content types, for example changing from discrete to continuous data, and so on. Rafal shows you the just-introduced version of this tool, now based on the shell of Visual Studio 2012.

Microsoft Decision Trees behave as three related, but signiﬁcantly diﬀerent techniques. The simplest, ﬂattened-data, that is a case-level decision tree, is the one that you might use most often. A more advanced form of the trees uses nested cases to perform associative analysis, which is similar in nature to the Association Rules algorithm. It is used to ﬁnd relationships between case-level attributes and the values of the nested key, as well as relationships between those keys. This technique builds a forest of decision trees, one for each value of the nested key, and then looks for relationships between the nodes of the trees in that forest. For example, you could use this technique to analyse customers and their demographical information (case-level inputs) and the purchases made by those customers (nested cases), as is shown in the extensive demo.

The third form of the trees is known as Regressive Decision Trees and it is used to model continuos data, such as income, proﬁt, or sales, as opposed to discrete, or discretised data—if you are not sure what those terms mean, follow our Data Mining Concepts and Tools tutorial. Regressive trees are based on the well-known statistical concept of regression analysis, which creates a formula to predict an outcome by means of a mathematical function of known, continuous inputs. There is, however, an additional beneﬁt of using a regressive decision tree to a simple regression formula. A tree is capable of including discrete data in a clever way: instead of building one formula, the tree is actually a tree of regression formulas, where each node is formed like in a traditional decision tree, by means of making the best split in the tree, based on the input that provides the most information, or, in other words, that has the largest impact on the predictable outcome. This is, conceptually, related to splines. Our demo brieﬂy shows how to test such a model, before using it, within Excel, to perform a live prediction (scoring) of proﬁt potential for a set of prospective customers. Incidentally, the Microsoft Linear Regression algorithm is simply a Regressive Decision Tree without any children, that is with only one, top-level, root node!

To get the most from Microsoft Decision Trees, you can parametrise them. The COMPLEXITY_PENALTY parameter helps you build a bushier, often easier to understand tree, or a slender, deeper tree, that may be more accurate, yet harder to read, in some cases. SPLIT_METHOD makes it possible to build binary trees, where each node has exactly two children, or complete trees, where each node represents all possible (and meaningful) values. SCORE_METHOD is the most interesting, but perhaps the least useful parameter, as it entirely changes the tree building process by using a diﬀerent formula for deciding when to make a split, that is when to create a new node, and how to select the most meaningful attribute (input column). There are three options that you can use: Entropy, Bayesian with K2 Prior, and Bayesian Dirichlet Equivalent with Uniform Prior (BDE). The entropy technique is the simplest, and it ﬁnds attributes that have the largest chance to make a diﬀerence to the output, but it disregards prior knowledge, already encoded in the higher levels of the tree, therefore it can be somewhat blind to what a person would consider an important hindsight. The remaining two methods use that knowledge, referred to in data mining as priors, but they do it in a slightly diﬀerent way. K2 uses a constant value, while BDE creates a weighted support for each predictable state based on the level in the tree and node support. Our video also explains the remaining parameters, which are more generic in nature: MAXIMUM_INPUT_ATTRIBUTES, MINIMUM_INPUT_ATTRIBUTES, and MINIMUM_SUPPORT.

Log in or purchase access below to the premium version of this content.

- Redeem a prepaid code
- Payment is instant and you will receive a tax invoice straight away.
- Your satisfaction is paramount: we offer a no-quibble refund guarantee.

* We guarantee each course to be available for at least 2 years from today, unless marked above with a date as a

- Introduction (00:08)
- Purposes of Decision Trees (02:05)
- Demo: Creating a Simple, Flat, Discrete Decision Tree in Excel (06:11)
- Demo: Building a Mining Structure and a Model in Excel (23:26)
- Building a Decision Tree using SQL Server Data Tools (29:22)
- Associative Analysis Using Nested Cases and Decision Trees (39:01)
- Demo: Associative, Nested, Decision Tree vs Flat Data Trees (45:17)
- Regressive Decision Trees for Continuous Data (67:44)
- Demo: Profitability Modelling using Regressive Decision Trees (73:38)
- Demo: Using a Regressive Decision Tree Model to Predict Profit in Excel (81:38)
- Parametrising Decision Trees COMPLEXITY_PENALTY (88:01)
- SPLIT_METHOD Parameter (90:48)
- Demo: Controlling Tree Growth with Algorithm Parametrisation (92:25)
- SCORE_METHOD Parameter (102:04)
- MAXIMUM_INPUT_ATTRIBIUTES, MINIMUM_INPUT_ATTRIBUTES, and MINIMUM_SUPPORT Parameters (107:53)
- Summary (112:58)

The Future Series (2019)
Mastering DAX Workshop (Current)
MDX: Multidimensional Expressions
SSAS Cubes
DAX: Data Analysis Expressions (Older)
SQL Server 2012 and SharePoint 2013 BI

Power BI (Current)

- Power BI New Features and Course UpdatesFree
- What is Power BI?Free
- Licensing
- Connecting to Data Sources
- Connecting to Data Sources (2018 Updates)
- Data Import Concepts
- Transforming Data with the Query Editor
- Transforming Data with the Query Editor (2018 Updates)
- Advanced Data Loading
- Modelling Data
- Modelling Data (2018 Updates)
- DAX and the Data Model
- DAX and the Data Model (2018 Updates)
- Power BI Desktop vs Excel
- Cloud and Desktop Power BI Dashboards and Reports
- Power BI Reports (2018 Updates)
- Building Reports in Excel
- Sharing, Distribution, and Security
- Data RefreshFree
- Administration and AuditingFree
- Azure Analysis Services
- Power BI Premium and Power BI EmbeddedFree

Data Mining

- Introduction to Data Mining with Microsoft SQL ServerFree
- Data Mining Concepts and Tools
- Data Mining Model Building, Testing and Predicting with Microsoft SQL Server and Excel
- What Are Decision Trees?Free
- Decision Trees in Depth
- Why Cluster and Segment Data?Free
- Clustering in Depth
- What is Market Basket Analysis?Free
- Association Rules in Depth
- HappyCars Sample Data Set for Learning Data Mining
- Code and Data Samples (R, R Services, SSAS)Free

- MDX Basic ConceptsFree
- MDX QueriesFree
- MDX Calculated Members
- MDX: Member and Set Functions
- What is the MDX Current Member Function?Free
- MDX Numeric Functions and CURRENTMEMBER
- MDX: Previous Period Growths, Shares & Moving Averages
- More MDX Member & Set Functions
- What is a Subselect in MDX?Free
- Exists, MDX Solve Order, and Subselects
- MDX Performance Diagnosis with Profiler

Data Science with Azure Machine Learning

Excel BI for Enterprises

Power BI in Excel 2013

Power BI 2017 (Course Retiring in 2020)

- What is Power BI?Free
- Connecting to Data Sources
- Data Import Concepts
- Transforming Data with the Query Editor
- Advanced Data Loading
- Modelling Data
- DAX and the Data Model
- Power BI Desktop vs Excel
- Cloud and Desktop Power BI Dashboards and Reports
- Building Reports in Excel
- Data RefreshFree
- Administration and AuditingFree

- DAX in Action!Free
- DAX: Calculated Columns vs. MeasuresFree
- Introduction to DAX in Excel
- DAX Evaluation Context
- DAX: The CALCULATE Function
- Querying with DAX
- What is Time Intelligence?Free
- DAX Time Intelligence
- Many-to-many Relationships in DAXFree
- DAX Patterns: Banding, New vs Old, Many-to-many
- How to Calculate with Hierarchies in DAX?Free
- Hierarchies in DAX
- DAX AW Sample Data Set

- Microsoft Business Analytics with Office 2013, SharePoint 2013 and SQL Server 2012
- Introduction to SQL Server 2012 Business Intelligence (Video)
- Introduction to PowerPivot for SQL Server 2012
- Introduction to Power View in SQL Server 2012 Reporting Services
- Introduction to BI Semantic Model & SQL Server 2012 Analysis Services

## Comments

alberto.gastaldo · 18 December 2013

Hi,

really cool video. Very clear and useful for me !

However I cannot understand when- in a dependency network - a relationship between nodes is mono or bi-directional.

Say that some people who bought a Bottle also bought bottle cage and some other did not (they just bought one of those products but not both).

Given this scenario, why node A predict node B and node B does not predict node A ? It looks like these two nodes should always have a bidirectional connection but this is not the case...

I am probably missing some concepts :-(

Thanks if you can clarify this point.

Alberto - Italy

Rafal Lukawiecki · 19 December 2013

Alberto, you have asked a very good question. A simple answer is that if we did not have an additional metric to consider, the relationship would

alwaysbe bi-directional, if people buy both products—precisely as you have expected.However, there is a metric of the

strengthof this relationship, which is based on theconditional probabilityof buying one product given the other has been purchased. This value depends on how many times one product has been soldwiththe other product vs. how many times they soldwithoutthe other, perhaps on their own. Since that is often asymmetrical (people may buy bottles on their own more often than together with a cage), the relationship becomes stronger in one direction.The ﬁnal clue to this puzzle are the labels on the

slidershown in the visualisation on the left. As you move it up towardsAll Links,or down toStrongest Links,you are adding or removing arrows representing this relationship. It is possible to have one product point to another, and, as you move the slider up, the arrow suddenly turns bi-directional, meaning that the oppposite holds, but is weaker than the ﬁrst relationship was.You will see a very similar situation when you use the

Association Rulesalgorithm, which will be explaineded in my next video.dknipple520 · 27 May 2015

I am confused at about 27:30, where you are demonstrating how to add a model to your structure and you are selecting the variables to be predicted. You mention that you select "Input and Predict", because whatever variable is being predicted will also be used as an input to predict itself. This does not make sense to me, as it seems there would be a 1:1 correlation between the input and predicted variables, and I don't understand how that would add value to the model. Could you clarify?

Rafal Lukawiecki · 27 May 2015

Deanne, thanks for asking. It is the underlying principle of the way data mining

supervised learningalgorithms work that they require to access the values of the outcome (also known as output, predictable, label, response variable etc) of every case (observation) in order to build a model. There is no way supervised learning could work without that. Clustering, on the other hand, is an example of unsupervised learning, that is algorithms that do not require a predictable outcome while building a model.There is no risk of the algorithm trying to build a correlation of the same variable with itself. It is looking for a relationship between other inputs and the output variable in question. However, if you selected several outputs, variables marked as

Predictwill also be used as inputs for predicting other predictable outputs, while those marked asPredictOnlywill not be used as inputs to predict other predictable outputs.goran_banjac · 18 June 2015

Hi Rafa,

Great material.

I have a question. Your first two models are "Number Of Purchases" and "Care Sales Flat". In first model you have single row per customer and you are predicting possible purchase. Second model is flat table but our are predicting model, type,...

My question is how valid and valuable would be to predict number of purchase using flat table model? I'm trying to create DT model that predict patient return to hospital, and if I don't sum number of diagnosis in order to retain single row per patient, could I have good model if I flatten out table and have as many patient rows as patient has diagnosis, and simple mark patient as 1 if he was previously returned and 0 if not?

I'd like to accomplish the model, where I can see what diagnosis had impact on patient return versus number of diagnosis. Thx!

Rafal Lukawiecki · 18 June 2015

Thanks for asking, Goran. This is a very good question! If you want to model (and predict) the number of purchases, then the ﬁrst model shown in the demo (Number of Purchases) is the one to use, because the predictable outcome is that number. The “ﬂat” model does something diﬀerent: it predicts the brand of the car, not the number of purchases. The nested DT would be able to also show you the connections (associations) between the brands and other inputs.

In your case, however, for predicting the

risk of readmission,you could takethreeapproaches. First of all, you could have multiple rows per patient, with each diagnosis on a diﬀerent row, plus the readmission ﬂag as output. Secondly, you could aggregate all the diagnoses as one number, and so have all in one row, with the readmission ﬂag as the outcome, as before. Finally, you could create a ﬂat row withseparate columnsfor each type of a diagnosis, which could be a boolean ﬂag, or an aggregate:a countof how many times this diagnosis the patient had—and the readmission ﬂag, again, as the outcome (label). All would work and would give you a diﬀerent model, and without knowing your data and patterns I cannot tell which would be the most accurate and reliable, so you should try them all.Having said that, I feel that the third approach would let you build several classiﬁers using diﬀerent algorithms most easily, perhaps you could also try that with Azure ML and a boosted decision tree or some of the other classiﬁers, notably an SVM and a Bayes Points Machine. On the other hand, the ﬁrst approach could be the most

reliable,however, it would have the least explanatory power, since information about the diagnosis was lost, and I suspect it would not be tooaccurate.The second approach, which you have suggested, would be roughly equivalent to the 3rd one (with a boolean ﬂag) because of the way SSAS DM works, but the one with the count (not ﬂag) of the diagnosis-per-patient would be diﬀerent, possibly more accurate. Contact me, please, if there is any way I could help you take this further, there are other ways to do it, too. And thanks for asking!