Data Science Concepts: Cases and Statistics Purchase this course

10 December 2014 · 3139 views

Fundamentals, Part 1

Rafal shows density plot using ggplot2 in R

Let Rafal, expert on predictive analytics, data mining, and machine learning introduce the most fundamental data science concepts in this 1-hour video. The key entity that we analyse in every data science project is known as a case, an observation, or a signature, especially when referring to customer relationship data mining. In their most common form, a case is just a ﬂat row of data columns, or attributes. This module introduces cases, in detail, explaining their components: predictable outputs, sometimes referred to as dependent variables, and the inputs to analysis, also known as independent variables. It is possible to start a data science project with quite a few inputs, even thousands, before you settle on the most relevant ones. Also, an important consideration in terms of the size of the data being analysed is the number of cases. It is not always a good idea to analyse too many of them, big-data-style, as it may lead only to ﬁnding bland and not useful patterns, especially if the set of cases spans longer periods of time, which may extend over signiﬁcant changes to underlying business practices. On the other hand, not having enough cases is also problematic, especially if not all realistic combinations of inputs and outputs are being well represented.

This can be also an issue when one outcome is signiﬁcantly rarer than the rest, as may be the case when building fraud analytics models. This is known as a class-imbalance problem, and it will be discussed in later modules in more detail, however a high-level approach to analysing outliers or exceptions is introduced early in this video to demonstrate the key Microsoft platform data science tools: Excel, SQL Server Analysis Services Clustering, and Azure Machine Learning, aka Azure ML.

Data wrangling (or munging) is a very important yet un-gloriﬁed aspect of data science. Generally, it is a good idea to prepare your data as close to its source as possible, using tools such as SQL if you can, and falling back onto more specialised tools, like Excel Power Query, Python or even R only when necessary.

R plays a very important role in today’s data science. It is a free, open-source package of statistical analysis software, which you can run on Windows, Linux, and Mac OS X—get it from http://cran.r-project.org. It is easier to use it in conjunction with a more modern development environment and R debugger known as RStudio, which you can get free-of-charge from http://www.rstudio.com/products/rstudio/download/. Completing the R toolkit is an R package called Rattle, which you will most easily install from within R or RStudio once those are running on your desktop. Rattle enables GUI-style descriptive statistics and data mining, and it is shown in the demo.