Project Botticelli

Data Mining Concepts and Tools Purchase this course

27 May 2012 · 16 comments · 10528 views

Fundamentals of SQL Server Data Mining

This 50-minute video introduces the fundamental concepts of Data Mining, a powerful analytical technology. We start by introducing to you the process of data mining and the SQL Server Analysis Services (SSAS) Data Mining architecture, using the Multidimensional and Data Mining Mode mode of SSAS. We introduce data mining tools, starting with Excel with the free Data Mining Add-Ins for Office, and focusing on SQL Server Data Tools (SSDT), which are well suited to longer-duration analytical mining projects.

The most fundamental concept in data mining is that of Cases, which represent the entities you wish to analyse, such as customers, products, or events. The simplest form of a case is just a flat, denormalized row of data. We briefly explain other formats of cases, too: Customer Signatures, which contains as-of validity dates, and Nested Cases, on which we focus towards the end of this tutorial, when you can also see a demo comparing the use of Decision Trees with, and without, nesting.

You will also learn about the concepts of Mining Structures, used to describe Cases, Mining Models, and Mining Algorithms. We briefly introduce 9 of the Microsoft data mining algorithms: Naïve Bayes, Clustering, Decision Trees, Association Rules, Sequence Clustering, Neural Networks, Logistics Regression, Linear Regression, and Time Series. They will be explained in more detail in other modules of this series. 

The remainder of this module discuses Column Data Types (especially Text, Long, and Double), and Column Content Types, focussing on the differences between Continuous, Discrete, and Discretized data. You will hear about different approaches to automatic Discretization, including Equal Areas, Clusters, and Thresholds technique, and about assisting the algorithms by hinting the statistical distribution of data in a column, such as Normal, LogNormal, or Uniform.

To help you learn, there are 5 demos in this module, which you can follow using your own datasets, Adventure Works from codeplex.com, or by downloading our educational dataset, HappyCars, available when you purchase access to this course.

Log in or purchase access below to the premium version of this content.

Purchase This Course or Full Access Subscription

Single Course

$250/once

Access this course for its lifetime*.
Purchase
Subscription Best Value

$480/year

Access all content on this site for 1 year.
Purchase
Group Purchase

from $480/year

For small business & enterprise.
Group Purchase

  • Redeem a prepaid code
  • Payment is instant and you will receive a tax invoice straight away.
  • Your satisfaction is paramount: we offer a no-quibble refund guarantee.

* We guarantee each course to be available for at least 2 years from today, unless marked above with a date as a Course Retiring.

Comments

hxy0135 · 29 May 2012

Hi Rafal,
As always, it is very good presentation.
We just started a task for product pricing analysis. The goal of the task is to maximizing our product prices as much as possible without losing a sale. How can we use data mining to facilitate the task? What algorithm would you think is good for our task?

Thank you for your time!
Hua

Rafal Lukawiecki · 30 May 2012

Dear Hua, thank you for your kind feedback, I am glad you liked this video. Your task is one of price simulation, and it can be modelled in many ways. Data Mining would be of use if you have already experimented with different prices, and you had information about sales that succeeded, or failed, in relation to those prices. You would then train a model on that experimental data, so that it would be able to predict the success/failure of a transaction, given the price, and perhaps other parameters that you wished to include (customer, product, or other characteristics).

I would think that a regression should work (Decision Tree with continuous data, or a Linear Regression), as the failure rate ought to be a smooth function of the price. If, however, you include other parameters, you would need to try the other algorithms too, notably a Neural Network. Ideally, you may want to try to develop a generalisation of sale success probability as a function of the price. Whatever you do, make sure to test the model, both numerically, and, if possible, using a control group study, if larger amounts were at risk.

hxy0135 · 30 May 2012

Rafal,
Thank you very much for your insights!

joseusul · 21 September 2012

Hi Rafal,

Congratulations for your great job!

It's incredible the quality of your presentations. You are always an inspirations for the newcomers to this world.

I would like to comment that it's funny that you quotes the great book "Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management" by Gordon Linoff and Michael Berry. I say that because I remember reading it that the authors commented that the ecommerce sites should avoid include, during the buying process, a field with "coupon discount" or something similar, because can discourage the buy, so the potential customer can try to find a coupon in internet to get the discount for free. O maybe during this search the user can find another similar site and finally we would have lost the sell.

Of course, it's only an anecdote :-)

Kind Regards. José Antonio.

Rafal Lukawiecki · 21 September 2012

Hi José Antonio, thanks for your compliments, much appreciated! Indeed, you are quite correct that having a coupon code field on checkout pages can discourage potential buyers. I am well aware of this phenomenon from our business, and also from my personal experience. I will speculate, however, that what this can lead to is a search for discount codes online, delaying a purchase, rather than outright stopping it. Also, it sensitises a value-oriented buyer to their presence, often creating a better perception of value when a code is found. However, I also feel that this only works if the discounts offered are relatively small, which, by the way, should also be a consistent business practice, in order not to diminish long-term value.

I am sure different e-commerce sites experience it differently, and it is also a compromise. For example, in our case, we offer an occassional 10% discount to our newsletter subscribers and to conference attendees. Also, when an account is about to expire, our system will offer a 10% discount for renewing it, while the account is still active—you should get an email about 3–4 days before the account expiration. One has to be able to handle this need, even if there is a potential downside, I think.

ness · 4 June 2013

Hi Rafal,

This presentation is really good and I had fun time watching it! Keep it up. :)

We are planning to use datamining for our Out of Stock or Speed to Shelf report.
The factors that affect it are the following:
Seasonality, Regularity, Days of the Week, proximity to pay day, Store lifestyle, Promotion and Price.

What you think is the best datamining model to use? Do you think having 2 years data to use for the training is enough or this is too small?

Your input will highly be appreciated.

Thanks,
Agnes

Rafal Lukawiecki · 7 June 2013

Agnes, thanks for asking. I think you have enough data, two years is usually a good amount as it shows you a whole-year pattern, but it is not too old, in case your business practices have been changing.

As in all data mining, analytical projects, I do not think you can home-in on a single, best modelling technique, but you need to work on a project-basis, changing the models, and the data, as you build your understanding. Having said that, I have a feeling that after some initial data cleansing and preparation, you may want to explore any high-level patterns using a Decision Tree, and perhaps use Clustering to group your stores, if they show markedly different customer behaviours—once you have your grouping/segments you can filter your models to make them more accurate. As it seems to me that you are trying to forecast a numerical, time-keyed outcome, you may want to consider Time Series for stock level forecasting—but make sure to test it well, using the historical model technique. I would also want to consider if your business problem (out of stock) could be treated as an event, and if it could, I would consider using Sequence Clustering, but I’m not sure if it would be the most practical approach.

Overall, the key to solving your quest is to find a good way to state the predictable outcome as a value of an attribute that could be modelled using the data you have.

Good luck!

Terry Lim · 26 December 2013

hi Rafal, I just joined, and had watched some of your lectures (my first encounter is Data Mining), and It was great and fun learning the concepts. However, for my work background, I think I would need initially to apply Data Mining in Excel. Is there a particular module or probably you may refer that tackle Data Mining in Excel? Thanks in advance.

Rafal Lukawiecki · 7 January 2014

Welcome, Terry! As you work your way through the modules, you will find that I tend to switch between Excel (with the Data Mining Add-ins installed) and SSDT. No single larger module focuses just on Excel or on SSDT, and, in frankness, I think you will find out that to make good use of Microsoft Data Mining you need to be using both tools, as they have significant advantages and they complement each other, especially with regards to the mining process.

On the other hand, if you look at some of the shorter, 5–10 minute modules, those tend to use Excel only. Having said that, your idea of focusing on Data Mining in Excel, without SSDT, is now on my list of videos to record in the future. Thanks for sharing it with me.

apalomo · 16 June 2014

Dear Rafal,

Congratulations on what are very well done videos! I am a young physicist hoping to move into Data Mining and Analytics, and I have decided to take on your tutorials as a first exposition to the topic.
The one thing I would like to comment critically, however, is your use of proprietary software tools all along. As someone coming from Academia, I usually prefer to stick with FSF (free software foundation) solutions, both because of financial considerations and from a principles point of view. As I gather information, I do believe these products exist for data mining purposes, and I would like to encourage you to do some tutorials using these instead of the proprietary ones you currently employ.

Thanks again.
Regards,
Alberto.

Rafal Lukawiecki · 17 June 2014

Alberto, thank you for your kind and thoughtful comments. In my professional life, I mainly use 6 analytical technologies at the same time: Microsoft SQL Server, R, Microsoft Excel, Python, Apache Hadoop with Mahout. I’d say I spend half my work in the commercial software world and the other half is various forms of open source and free software, which I like very much and try to support. However, I have good reasons to use SQL Server:

  1. Most of my customers already have SQL Server, and they do not need to pay anything extra to do data mining in it, and as their data is already in there, it is also easier than having to wrangle it in and out.
  2. I like the dynamic, interactive visualisations of SSDM. While I can build a decision tree in R using rpart etc, I cannot visualise it as nicely…
  3. SSDM is really, really fast for even very large data sets, because, at heart, it is a good data engine.

However, I like using R for all the statistics and for model validation and assesment, not to mention additional code for solving unique issues, like automating point look-ups on curves (root solving) etc. I like Python for programmatic data wrangling. And I like Excel, especially with Power Query and Power Pivot, for much data preprocessing while iterating my models rapidly.

What all of this means is that I am 100% committed to recording additional modules which will show those tools too. I can’t wait to have a nice intro to R for data miners, but it may have to wait a little, because my next focus, after I publish the Association Rules modules, will be on the upcoming machine learning in the cloud, Azure ML—have a look at my short blog about it. Best of all, it allows you to use open source algorithms extensively, if you prefer those, so maybe something of interest to you, too. Thanks for becoming a member, very much. RL.




chukliz · 24 June 2015

This is a great course! You have simplified the concept of data mining to some of us who are new to the world of data mining.

Thank you for this wonderful piece of work.
Well done!!
EO

Rafal Lukawiecki · 24 June 2015

Thanks, Elizabeth! I am glad you have enjoyed. Much more on data science on its way later this year.

cneal4 · 8 September 2016

Great explanation of everything!

Samibd2017 · 2 January 2017

Hi Rafal,
I am fan of your tutorials and i have been following you for quite some times, i did not get a chance to become full member until today.

I am currently working for a client is a Stock Market Prediction Project. I am still playing with the data, was able to built a time series model and also was able to generate a forecast, well the problem is, i am not able to export the data from SQL Data Mining in tabular Format to Excel so that i can further do some analysis. Is there any way to do so?

Regards,
Sami

My data Format is:

Date Product Rate
01-Jan-16 A 1
02-Jan-16 A 2
03-Jan-16 A 3
04-Jan-16 A 4
05-Jan-16 A 5
06-Jan-16 A 6
07-Jan-16 A 7
08-Jan-16 A 8
09-Jan-16 A 9
[...]
20-Jan-16 C 60
21-Jan-16 C 63

Rafal Lukawiecki · 28 February 2017

Hi Sami, thanks for joining. I would suggest using the Data Mining Add-ins for Office, as they give you everything that SQL Server Data Mining can do while being in Excel, so there is no need to export or import. However, you could also write the results of your prediction to a SQL table and then import that using Excel's "Get Data". That second approach would also work if you wanted to use Power BI.

Online Courses