Decision Trees in Depth

Jump to a chapter

Introduction (00:08)
Purposes of Decision Trees (02:05)
Demo: Creating a Simple, Flat, Discrete Decision Tree in Excel (06:11)
Demo: Building a Mining Structure and a Model in Excel (23:26)
Building a Decision Tree using SQL Server Data Tools (29:22)
Associative Analysis Using Nested Cases and Decision Trees (39:01)
Demo: Associative, Nested, Decision Tree vs Flat Data Trees (45:17)
Regressive Decision Trees for Continuous Data (67:44)
Demo: Profitability Modelling using Regressive Decision Trees (73:38)
Demo: Using a Regressive Decision Tree Model to Predict Profit in Excel (81:38)
Parametrising Decision Trees COMPLEXITY_PENALTY (88:01)
SPLIT_METHOD Parameter (90:48)
Demo: Controlling Tree Growth with Algorithm Parametrisation (92:25)
SCORE_METHOD Parameter (102:04)
MAXIMUM_INPUT_ATTRIBIUTES, MINIMUM_INPUT_ATTRIBUTES, and MINIMUM_SUPPORT Parameters (107:53)
Summary (112:58)

Decision Trees are the most useful Microsoft data mining technique: they are easy to use, simple to interpret, and they work fast, even on very large data sets. In essence, a decision tree is just a tree of nodes. Each node represents a logical decision, which you can just think of as a choice of a value of one of your inputs that would make the most profound diﬀerence to the output that you wish to study. Once you try a decision tree a few times, you will realise how easy, and useful they are to help you understand any sets of data. This almost 2-hour, in-depth video by Rafal starts with an explanation of the three key uses of decision trees, which are: data classiﬁcation, regression, and associative analysis, and then takes you on a comprehensive tour of this data mining algorithm, covering it in slides and detailed, hi-def demos. As this is a large module, make sure to use the “Jump to chapter” links, in the right-hand column of the page.

You can create a decision tree in several ways. It is simplest to start in Excel, using the Classify button on the Data Mining ribbon, as shown in the ﬁrst demo, in which you can see how to classify customers in terms of their long-term loyalty to a retailer, as measured by the number of life-time purchases. It is, however, more convenient to use SQL Server Data Tools (SSDT) to work with your decision trees on an ongoing basis, especially if you plan to change parameters, or you want to experiment with diﬀerent content types, for example changing from discrete to continuous data, and so on. Rafal shows you the just-introduced version of this tool, now based on the shell of Visual Studio 2012.

Microsoft Decision Trees behave as three related, but signiﬁcantly diﬀerent techniques. The simplest, flattened-data, that is a case-level decision tree, is the one that you might use most often. A more advanced form of the trees uses nested cases to perform associative analysis, which is similar in nature to the Association Rules algorithm. It is used to ﬁnd relationships between case-level attributes and the values of the nested key, as well as relationships between those keys. This technique builds a forest of decision trees, one for each value of the nested key, and then looks for relationships between the nodes of the trees in that forest. For example, you could use this technique to analyse customers and their demographical information (case-level inputs) and the purchases made by those customers (nested cases), as is shown in the extensive demo.

The third form of the trees is known as Regressive Decision Trees and it is used to model continuos data, such as income, proﬁt, or sales, as opposed to discrete, or discretised data—if you are not sure what those terms mean, follow our Data Mining Concepts and Tools tutorial. Regressive trees are based on the well-known statistical concept of regression analysis, which creates a formula to predict an outcome by means of a mathematical function of known, continuous inputs. There is, however, an additional beneﬁt of using a regressive decision tree to a simple regression formula. A tree is capable of including discrete data in a clever way: instead of building one formula, the tree is actually a tree of regression formulas, where each node is formed like in a traditional decision tree, by means of making the best split in the tree, based on the input that provides the most information, or, in other words, that has the largest impact on the predictable outcome. This is, conceptually, related to splines. Our demo briefly shows how to test such a model, before using it, within Excel, to perform a live prediction (scoring) of proﬁt potential for a set of prospective customers. Incidentally, the Microsoft Linear Regression algorithm is simply a Regressive Decision Tree without any children, that is with only one, top-level, root node!

To get the most from Microsoft Decision Trees, you can parametrise them. The COMPLEXITY_PENALTY parameter helps you build a bushier, often easier to understand tree, or a slender, deeper tree, that may be more accurate, yet harder to read, in some cases. SPLIT_METHOD makes it possible to build binary trees, where each node has exactly two children, or complete trees, where each node represents all possible (and meaningful) values. SCORE_METHOD is the most interesting, but perhaps the least useful parameter, as it entirely changes the tree building process by using a diﬀerent formula for deciding when to make a split, that is when to create a new node, and how to select the most meaningful attribute (input column). There are three options that you can use: Entropy, Bayesian with K2 Prior, and Bayesian Dirichlet Equivalent with Uniform Prior (BDE). The entropy technique is the simplest, and it ﬁnds attributes that have the largest chance to make a diﬀerence to the output, but it disregards prior knowledge, already encoded in the higher levels of the tree, therefore it can be somewhat blind to what a person would consider an important hindsight. The remaining two methods use that knowledge, referred to in data mining as priors, but they do it in a slightly diﬀerent way. K2 uses a constant value, while BDE creates a weighted support for each predictable state based on the level in the tree and node support. Our video also explains the remaining parameters, which are more generic in nature: MAXIMUM_INPUT_ATTRIBUTES, MINIMUM_INPUT_ATTRIBUTES, and MINIMUM_SUPPORT.

Log in or purchase access to play the video.

Data Mining with SQL Server SSAS

Introduction to Data Mining with Microsoft SQL Server 24-min Watch with Free Subscription
Data Mining Concepts and Tools 50-min
Data Mining Model Building, Testing and Predicting with Microsoft SQL Server and Excel 1-hour 20-min
What Are Decision Trees? 10-min Free—Watch Now
Decision Trees in Depth 1-hour 54-min
Why Cluster and Segment Data? 9-min Watch with Free Subscription
Clustering in Depth 1-hour 50-min
What is Market Basket Analysis? 10-min Watch with Free Subscription
Association Rules in Depth 1-hour 35-min
HappyCars Sample Data Set for Learning Data Mining
Additional Code and Data Samples (R, ML Services, SSAS) Get with Free Subscription

Purchase a Full Access Subscription

Individual Subscription

$480/year

Access all content on this site for 1 year.
Purchase

Group Purchase

from $480/year

For small business & enterprise.
Group Purchase

You can also redeem a prepaid code.
Payments are instant and you will receive a tax invoice straight away.
We oﬀer sales quotes/pro-forma invoices, and we accept purchase orders and bank transfers.
Your satisfaction is paramount: we oﬀer a no-quibble refund guarantee.
See pricing FAQ for more detail.

Decision Trees in Depth Purchase the entire course

Classification, tree and linear regression, and associative analysis

Jump to a chapter

Data Mining with SQL Server SSAS

Purchase a Full Access Subscription

Individual Subscription

Group Purchase

In collaboration with

Company

Courses

Resources

Help

Search form

Decision Trees in Depth Purchase the entire course

Classification, tree and linear regression, and associative analysis

Jump to a chapter

Data Mining with SQL Server SSAS

Purchase a Full Access Subscription

Individual Subscription

Group Purchase

Get the Newsletter

In collaboration with