Project Botticelli

New BI Content (June 2013)

Clustering: Segment, Categorise, Find Outliers

When you registered, you asked to be notified about new content:

Why Cluster and Segment Data?

Rafal discusses why non-traditional data segmentation can be very valuable

Clustering is a popular data mining technique, often used for segmentation. In this short, 9–minute video, I introduce these concepts, focusing on the reasons why it is interesting to use clustering to find non-traditional segments. In the demo, you will see a clustering model, and we will use it to categorise new data in Excel. If you ever wondered what's so great about building your own segments (the clue is in the screenshot above), this video would be right for you.

Clustering in Depth

Cluster Profiles: Rafal shows a dashboard-like way to analyse your clusters

After watching the earlier intro to segmentation, dive into this in-depth tutorial. Microsoft Clustering is a workhorse algorithm of data mining. In addition to segmentation and categorisation, it is also used to explore data, and to detect outliers, exceptions, or anomalies. Each cluster represents naturally occuring groupings of your data points, grouped by their similarities, as described by their various attributes. In this in-depth, 1-hour 50-minute video, I explain clustering concepts, the entire process, and all of the algorithm parameters, including the all-important CLUSTERING_METHOD, which let's you choose the EM (Expectation Maximization) or the K-Means underlying algorithms—indeed, you have not just one, but two (or even four!) clustering techniques under a single, easy-to-use wrap. Clustering is a bit different from other mining techniques, because once you find the clusters, they are of less use until you have understood them enough to give them meaningful names. The detailed 12-part demo, which forms the heart of my tutorial, shows you the iterative process of doing just that, explaining how to segment your own data, such as customers, or products. If you are into advanced analytics, you need to know clustering, and I promise I make it easy-to-follow.

News: SQL Server 2014 CTP1 and HDInsight for Azure

Microsoft have just released CTP1 of SQL Server 2014. The key new bits, some of which I have already described in my older article, are: xVelocity ColumnStore, which is not the "old" ColumnStore Index, but an actual data storage technology particularly useful for some data warehousing applications, Hekaton (in-memory OLTP), ability for server bufferpool memory to be "extended" to use SSDs, and a few updated high availability features. By the way, you can grab a ready-image of CTP1 in the Windows Azure Image Gallery, so you don't even need to spend time installing anything.

In case you have not noticed, the heart of Microsoft Big Data approach, that is the Microsoft/Hortonworks implementation of the Hadoop ecosystem, called HDInsight for Azure, has left its early proving grounds and has moved to the "proper" Azure recently, where it is available as a fairly mature preview. If you have not applied for it, go ahead to your Azure portal, press the plus button to add a new Data Service/HDInsight, good luck getting a quick acceptance, and have a go at some of the samples (I suggest Mahout, of course). I have been lucky to have had access to the preview of both the earlier, and the newer incarnations, and I am glad to see that the UI has stabilised (mostly!) and that the integration with Azure is now closer, thanks to the switch to the use of Azure Blob Storage for all of HDInsight. I am still unsure about the performance impact of not using HDFS (have a look at my earlier article about Big Data) in this version of HDInsight, but my early tests show that having it run directly over Azure Storage works just fine for my data sets and my applications—I'll blog more about it. I had a chance to show it to one of my customers recently, and we had some early, promising success of applying Mahout for customer taste analysis. For this customer, it was a good example of how useful is the combination of the simplicity of Excel's front-end with the power of HDInsight, working behind the scenes. I am planning to record two videos on this subject during the summer, with some luck in August. I realise that there is very little formal guidance, or good video content about Mahout, and almost none about Mahout on HDInsight, so please follow this newsletter, RSS, or Twitter or LinkedIn to be notified when this content is available. By the way, I was very pleased with the low-cost nature of HDInsight. Having run a reasonably large analytical job on a 32-node cluster cost only about $5! In the old days it would have been 2-3 orders of magnitude or more. That "supercomputer for pennies" is finally becoming a reality, and with the good competition from Amazon Web Services Elastic MapReduce it will only get cheaper—exciting times ahead for anyone analytically minded.

Promotion: 10% Off!

If you are not a Full Access Member, or if you would like to extend your existing membership, simply use code NEWSLETTER2013JUNE for a 10% discount, redeemable just until 15 July. It is valid for all memberships, including, this time, our Group Memberships, great for a team wanting to learn analytics. 

Thanks for reading, and thank you for being our online member.

Rafal Lukawiecki, Strategic Consultant and Director, Project Botticelli Ltd

Project Botticelli New Content Announcements