Clustering in Depth

Jump to a chapter

Introduction (00:08)
Purposes of Microsoft Clustering (02:12)
Demo: Simple Outlier Detection (05:08)
How Does Clustering and Outlier Detection Work? (11:50)
Predict vs. Predict Only (16:36)
Process of Clustering (20:41)
Demo: Customer Segmentation in Excel (25:04)
Demo: Customer Segmentation in SSDT (29:13)
Demo: Cluster Diagram (31:13)
Demo: Cluster Profiles (33:35)
Demo: Cluster Characteristics (37:50)
Demo: Cluster Discrimination (Tornado Chart) (39:32)
Demo: Giving Clusters Meaningful Names (41:11)
Demo: Validating Cluster Names (47:42)
Demo: Assigning Segments, Cluster and ClusterProbability Functions (54:27)
Clustering as a Step in Data Preprocessing (65:14)
Demo: Mining Clustered Data with Decision Trees (67:40)
EM or K-Means Clustering? (76:50)
CLUSTERING_METHOD Parameter (78:39)
SAMPLE_SIZE Parameter (82:11)
CLUSTER_COUNT Parameter (83:40)
Demo: Balancing Cluster Count and Choice of EM vs K-Means Algorithm (86:18)
Demo: Impact of Changing the Number of Clusters on Model's Accuracy (93:04)
MODEL_CARDINALITY and STOPPING_TOLERANCE Parameters (98:20)
CLUSTER_SEED and NORMALIZATION Parameters (101:35)
Feature Selection, MAXIMUM_INPUT_ATTRIBUTES, MAXIMUM_STATES, and MINIMUM_SUPPORT Parameters (105:22)
Summary (109:15)

Microsoft Clustering is a a workhorse algorithm of data mining. It is used to explore data, segment and categorise it, and to detect outliers (or exceptions, anomalies). Each cluster represents naturally occuring groupings of your data points, grouped by their similarities, shown in their attributes. In this in-depth, 1-hour 50-minute video, Rafal explains clustering concepts, the entire process, and all of the algorithm parameters. The detailed 12-part demo, which forms the heart of this tutorial, shows you the iterative process of clustering, explaining how to segment your own data, such as customers, or products. Before watching this video, make sure to review the preceding, short, free 10-minute video “Why Cluster and Segment Data?”

In essence, the key to understanding clustering is to imagine that all of your cases (ie. data points, such as customers) are dots that exist in a vastly multidimensional space, where each dimension represents an attribute, that you would like to study, such as: age, gender, number of children, relationship to the head of household, income, etc. It is easy to imagine 3 dimensions: try it, or see the example in the video. However, clustering algorithm can represent your data points in even hundreds or thousands of dimensions! If you have imagined your data points suspended somewhere in a 3D space, you should realise that they do not ﬁll the entire space uniformly, but they group, or cluster, into natural constellations of points.

When you have found your clusters you need to name them—this process is iterative. You start by trying to understand the natural meaning of each of the clusters by using cluster visualisations: the high-level Cluster Diagram that shows similarities between clusters, the dashboard-style Cluster Proﬁles that shows everything-at-a-glance, the individual Cluster Characteristics, and the Cluster Discriminations diagram, also known as the Tornado Chart, which is particularly useful for comparing the meaning of one cluster against another one, or for comparing the meaning of one cluster against all the other clusters (also referred to as the cluster’s complement). The extensive demo of the clustering process begins with a simpler, but still powerful, approach to clustering using Microsoft Excel and the free Data Mining Add-ins. Then, we continue onto each individual step of the process, using SSDT, SQL Server Data Tools.

To validate your clusters, or to simply use your clustering model, you can categorise a diﬀerent data set. For example, you can take a set of customers, which you have not used for the building (training) of your model, and you apply your new model to predict the names of the clusters to which each such customer might belong. The demo shows how to use the Cluster() and ClusterProbability() prediction functions to categorise data in Excel using a model that has been built earlier in SSDT. The number of clusters that you wish to build, which you can specify as the value of the CLUSTER_COUNT parameter, has a great impact on the accuracy of your model—the demo shows how even a small change to that number signiﬁcantly changes the model accuracy lift chart. If you are new to Lift Charts and mining model validation, make sure to review the relevant section of our “Data Mining Model Building, Testing and Predicting” video tutorial.

Microsoft SQL Server Data Mining comes with two diﬀerent clustering techniques: EM, Expectation Maximization, and the widely known, K-Means, algorithms. You decide which one should be used by setting the value of the CLUSTERING_METHOD parameter, which takes four options (1–4), letting you also select between a Scalable or a Non-Scalable approach. The default value, 1, stands for scalable EM, and is often the best choice, because it permits clusters to overlap each other, which is a realistic representation of many natural data sets. However, if you are dealing with cases that have very clear-cut diﬀerences, which are unlikely to overlap, you may want to use the K-Means technique (parameter values of 3 and 4) instead of EM (values 1 and 2). You can use both techniques on continuous, discrete and discretized data—without even having to change your data set!—which a convenient feature of Microsoft clustering. You can further control the scalable approach with the SAMPLE_SIZE parameter which you should set to as large a value as your memory and CPU permits, or, if you have a large enough computer, use the value of 0 which tries to load the entire data set into memory.

The remaining parameters of the clustering algorithm can be used to balance the time it takes for clustering to complete its model training run against its accuracy. Because the algorithm is sensitive to the random starting values—seeded with the CLUSTER_SEED parameter—it normally builds not one, but, by default, 10 diﬀerent models, and selects the best. You can change the number of models being built by using the MODEL_CARDINALITY parameter. You decide how stable the model has to become before clustering stops working on it by changing the STOPPING_TOLERANCE parameter, which, with a default value of 10 cases, means that up to 10 cases may remain “undecided” between the internal iterations of the algorithm in terms of their membership to a cluster. If you have a small data set, you may want to reduce this number to 1. In any case, with today’s machines that come with muscular CPUs and plenty of memory, it is easy to look for more accurate models by reducing this parameter. The remaining parameters, which help you ﬁne-tune the algorithm to your data set, include the unoﬃcial NORMALIZATION parameter, which you may want to change for those less-typical data sets that happen to have a non-normal distribution (eg. a uniform distribution), and the usual to SQL Server parameters that control how it selects signiﬁcant attributes and their value states from your data. All of those are explained in detail in this tutorial: the MAXIMUM_INPUT_ATTRIBUTES (feature selection parameter, default 255), MAXIMUM_STATES (default 100), and MINIMUM_SUPPORT (default 1) useful for pruning out clusters that would contain fewer than that value number of cases.

Log in or purchase access to play the video.

Data Mining with SQL Server SSAS

Introduction to Data Mining with Microsoft SQL Server 24-min Watch with Free Subscription
Data Mining Concepts and Tools 50-min
Data Mining Model Building, Testing and Predicting with Microsoft SQL Server and Excel 1-hour 20-min
What Are Decision Trees? 10-min Free—Watch Now
Decision Trees in Depth 1-hour 54-min
Why Cluster and Segment Data? 9-min Watch with Free Subscription
Clustering in Depth 1-hour 50-min
What is Market Basket Analysis? 10-min Watch with Free Subscription
Association Rules in Depth 1-hour 35-min
HappyCars Sample Data Set for Learning Data Mining
Additional Code and Data Samples (R, ML Services, SSAS) Get with Free Subscription

Purchase a Full Access Subscription

Individual Subscription

$480/year

Access all content on this site for 1 year.
Purchase

Group Purchase

from $480/year

For small business & enterprise.
Group Purchase

You can also redeem a prepaid code.
Payments are instant and you will receive a tax invoice straight away.
We oﬀer sales quotes/pro-forma invoices, and we accept purchase orders and bank transfers.
Your satisfaction is paramount: we oﬀer a no-quibble refund guarantee.
See pricing FAQ for more detail.

Clustering in Depth Purchase the entire course

Segmentation, Outlier Detection, Categorisation & Exploration

Jump to a chapter

Data Mining with SQL Server SSAS

Purchase a Full Access Subscription

Individual Subscription

Group Purchase

In collaboration with

Company

Courses

Resources

Help

Search form

Clustering in Depth Purchase the entire course

Segmentation, Outlier Detection, Categorisation & Exploration

Jump to a chapter

Data Mining with SQL Server SSAS

Purchase a Full Access Subscription

Individual Subscription

Group Purchase

Get the Newsletter

In collaboration with