Project Botticelli

Clustering in Depth Purchase this course

26 June 2013 · 4 comments · 3794 views

Segmentation, Outlier Detection, Categorisation & Exploration

Clustering: Cluster Profiles Diagram in SSDT

Microsoft Clustering is a a workhorse algorithm of data mining. It is used to explore data, segment and categorise it, and to detect outliers (or exceptions, anomalies). Each cluster represents naturally occuring groupings of your data points, grouped by their similarities, shown in their attributes. In this in-depth, 1-hour 50-minute video, Rafal explains clustering concepts, the entire process, and all of the algorithm parameters. The detailed 12-part demo, which forms the heart of this tutorial, shows you the iterative process of clustering, explaining how to segment your own data, such as customers, or products. Before watching this video, make sure to review the preceding, short, free 10-minute video “Why Cluster and Segment Data?”

In essence, the key to understanding clustering is to imagine that all of your cases (ie. data points, such as customers) are dots that exist in a vastly multidimensional space, where each dimension represents an attribute, that you would like to study, such as: age, gender, number of children, relationship to the head of household, income, etc. It is easy to imagine 3 dimensions: try it, or see the example in the video. However, clustering algorithm can represent your data points in even hundreds or thousands of dimensions! If you have imagined your data points suspended somewhere in a 3D space, you should realise that they do not fill the entire space uniformly, but they group, or cluster, into natural constellations of points.

When you have found your clusters you need to name them—this process is iterative. You start by trying to understand the natural meaning of each of the clusters by using cluster visualisations: the high-level Cluster Diagram that shows similarities between clusters, the dashboard-style Cluster Profiles that shows everything-at-a-glance, the individual Cluster Characteristics, and the Cluster Discriminations diagram, also known as the Tornado Chart, which is particularly useful for comparing the meaning of one cluster against another one, or for comparing the meaning of one cluster against all the other clusters (also referred to as the cluster’s complement). The extensive demo of the clustering process begins with a simpler, but still powerful, approach to clustering using Microsoft Excel and the free Data Mining Add-ins. Then, we continue onto each individual step of the process, using SSDT, SQL Server Data Tools.

To validate your clusters, or to simply use your clustering model, you can categorise a different data set. For example, you can take a set of customers, which you have not used for the building (training) of your model, and you apply your new model to predict the names of the clusters to which each such customer might belong. The demo shows how to use the Cluster() and ClusterProbability() prediction functions to categorise data in Excel using a model that has been built earlier in SSDT. The number of clusters that you wish to build, which you can specify as the value of the CLUSTER_COUNT parameter, has a great impact on the accuracy of your model—the demo shows how even a small change to that number significantly changes the model accuracy lift chart. If you are new to Lift Charts and mining model validation, make sure to review the relevant section of our “Data Mining Model Building, Testing and Predicting” video tutorial.

Microsoft SQL Server Data Mining comes with two different clustering techniques: EM, Expectation Maximization, and the widely known, K-Means, algorithms. You decide which one should be used by setting the value of the CLUSTERING_METHOD parameter, which takes four options (1–4), letting you also select between a Scalable or a Non-Scalable approach. The default value, 1, stands for scalable EM, and is often the best choice, because it permits clusters to overlap each other, which is a realistic representation of many natural data sets. However, if you are dealing with cases that have very clear-cut differences, which are unlikely to overlap, you may want to use the K-Means technique (parameter values of 3 and 4) instead of EM (values 1 and 2). You can use both techniques on continuous, discrete and discretized data—without even having to change your data set!—which a convenient feature of Microsoft clustering. You can further control the scalable approach with the SAMPLE_SIZE parameter which you should set to as large a value as your memory and CPU permits, or, if you have a large enough computer, use the value of 0 which tries to load the entire data set into memory.

The remaining parameters of the clustering algorithm can be used to balance the time it takes for clustering to complete its model training run against its accuracy. Because the algorithm is sensitive to the random starting values—seeded with the CLUSTER_SEED parameter—it normally builds not one, but, by default, 10 different models, and selects the best. You can change the number of models being built by using the MODEL_CARDINALITY parameter. You decide how stable the model has to become before clustering stops working on it by changing the STOPPING_TOLERANCE parameter, which, with a default value of 10 cases, means that up to 10 cases may remain “undecided” between the internal iterations of the algorithm in terms of their membership to a cluster. If you have a small data set, you may want to reduce this number to 1. In any case, with today’s machines that come with muscular CPUs and plenty of memory, it is easy to look for more accurate models by reducing this parameter. The remaining parameters, which help you fine-tune the algorithm to your data set, include the unofficial NORMALIZATION parameter, which you may want to change for those less-typical data sets that happen to have a non-normal distribution (eg. a uniform distribution), and the usual to SQL Server parameters that control how it selects significant attributes and their value states from your data. All of those are explained in detail in this tutorial: the MAXIMUM_INPUT_ATTRIBUTES (feature selection parameter, default 255), MAXIMUM_STATES (default 100), and MINIMUM_SUPPORT (default 1) useful for pruning out clusters that would contain fewer than that value number of cases.

Log in or purchase access below to the premium version of this content.

Purchase This Course or Full Access Subscription

Single Course

$250/once

Access this course for its lifetime*.
Purchase
Subscription

$480/year

Access all content on this site for 1 year.
Purchase
Group Purchase

from $480/year

For small business & enterprise.
Group Purchase

  • Redeem a prepaid code
  • Payment is instant and you will receive a tax invoice straight away.
  • Your satisfaction is paramount: we offer a no-quibble refund guarantee.

* We guarantee each course to be available for at least 2 years from today, unless marked above with a date as a Course Retiring.

Comments

luchofarje · 11 June 2014

Hello Rafal, I'm trying to run the excel sheet but when I click on "cluster data" button I got this error: an error occurred while parsing the x element at line 25 column 25 (urn:schemas-microsoft-com:xml-analysus:rowset namespace)...
I have fixed in VB code the connection to my server but I have no idea what this problem is?
I'm working against 2012 dimensional enterprise installation.
Should I use 2008 SQL server installation instead?? just for this demo
Best regards,
Lucho Farje

Rafal Lukawiecki · 17 June 2014

Dear Lucho, you should not be getting that error, and there is no need to edit the VB code. If you do, it usually means you have some other issue, such as the mismatch of DLLs from a previous installation. Which version of Excel are you using? It should work well in Excel 2010 and 2013, and even in 2007, but not so well in the older ones. SQL Server 2012 Analysis Services multidimensional works well with the add-ins, but so does SQL 2008 and 2008 R2. Make sure the connection is selected using the "Connection" button and that it works when you press the Test button. Also, consider putting the whole set-up on one machine to avoid connectivity issues. Contact me for more help, and good luck! Rafal.

eng.ahmed.anwar · 10 August 2015

Hi Rafal,

I've created a .Net application to imitate Exception Highlighting button functionality as in Excel. My application uses PredictCaseLikelihood function in a singleton. The function works well for predicting a case. However, what I really don't understand is how Excel highlights a column in yellow? How does Excel know that a particular column in a case is the reason to be an outlier? How can I implement this in my application? BTW, I ran SQL Server Profiler to find how Excel does this but still can't figure it out.

Rafal Lukawiecki · 8 September 2015

Dear Ahmed,

You are on the right track: using PredictCaseLikelihood will help you find out which rows are sufficiently unlikely to be considered as outliers. However, to mimick what Excel does when it highlights a column in a row in a brighter yellow, you need to calculate something like “column likelihood” for every column in that row, and then to highlight the least likely one. The easiest way to do it is to calculate the probability of the given value of a column (such as “Female”) and divide it by the likelihood of the most likely value of the column in that row. This gets you a ratio, for each column, and the one with the smallest value is the least “likely” and gets highlighted in bright yellow. For continuous data, use variances of the predictions. The DMX code is something like:

PredictProbability(MyModelName.Gender, ‘Female’)/PredictProbability(Gender)

For a working, live example, with source code, may I suggest you have a look here.

Online Courses