Azure Machine Learning is in public preview, available to everyone who has an Azure account (even trial ones), as of last night. Well done, Microsoft! If you would like to see why I have been enthusiastic about this technology, have a look at my high-level why does it matter short news piece, written a month ago, or stay here to find out what it is all about. You should also have a look at my short 10-minute demo video, What is Azure ML? and please also consider following our full online course on data science.
[Please also have a look at the newer, 2015 article What is Advanced Analytics, Data Science, Machine Learning—and What is their Value?]
In short: in order to do predictive analytics with Azure Machine Learning, you just have to:
Azure ML (aka Project Passau) has two major conceptual components: Experiments and Web Services, and one development tool, called ML Studio. You can invite other people, who have a Microsoft Account/Live ID, to collaborate on your workspaces using ML Studio, and they do not even need to be paying for an Azure subscription to work with you.
Experiments look like data-flow configurations of what you would like to do with your information and with your models. You, as an Azure ML data scientist, focus on the experiments, and you may spend all of your time in ML Studio, doing nothing else but reworking them, changing parameters, algorithms, validation criteria, repeatedly amending data, and so on. ML Studio is a web application and it looks a bit like the Azure portal (ca mid-2014). It feels clean, nice, and it seems to work relatively well not only in IE but also in Firefox and Chrome—ok, there are still some UI niggles, but hey, this is v1 preview.
ML Studio is the place where you begin your work by deciding what data sources you wish to use: sets uploaded by you, or accessed live, using a Reader, from: a web page, OData, SQL Azure, Windows Azure, Hive, or Azure BLOB. Then you may want to apply some Data Transformations like groupings, renaming columns, joins, removals of duplicates or the very useful binning/discretisation. There are even some fancy transformations, like the Finite and Infinite Input Response filters, which are used in signal processing but may be of wider use if you consider that some economic data, especially time series, can be thought of as a complex waveform—part of the job of seasonality detection is usually concerned with finding the music-like frequencies of those seasonalities. Also, if you are at the beginning of a project, and you are not sure which columns to include, then the automatic Feature Selection filters may help, by presenting you with a good choice of correlation metrics. In real use, however, you will want to prune the selection of columns by hand, at a later stage, for maximum accuracy.
Now you reach the bit we have been waiting for: doing the actual Machine Learning, that is: Initialize (define) a model, Train it with some data, Evaluate its performance and validity, and, if all is good, Score it (make predictions with it). Azure ML introduces a lot of algorithms for Classification tasks, including Multiclass and Two-Class Decision Forests, Decision Jungles (developed by Microsoft Research), Logistic Regression, Neural Networks as well as Two-Class Averages Perceptrons, Bayes Point Machine, Boosted Decision Trees and Support Vector Machines (SVM). Clustering uses a variant of the standard K-Means approach. Regressions include Bayesian Linear, Boosted Decision Trees, Decision Forests, Linear Regression (of course), Neural Network Regression, Ordinal and Poisson Regression. And this is just version 1.
You can also apply useful Statistical functions in your experiments, including common Elementary Statistics, like calculating variances. Indeed, start by simply pointing your dataset to the Descriptive Statistics task and Visualise the results while you are beginning (get used to right-clicking the bullet-like output connection points on the tasks). Enjoy the boxplots in those visualisations—something long-missing from the rest of Microsoft BI, even in Excel…
One cool example of Azure ML bringing some external intelligence into your experiments shows up in the Text Analytics task section. Named Entity Recognition task will parse your input texts (called stories, eg. emails, typed case descriptions, or tweets) and it will extract named terms from it, classifying them, automatically, as People, Locations, or Organisations. There is also support for Vowpal Wabbit, which came from Yahoo and Microsoft Research. You can use it for entity hashing at the moment. I expect much more to appear in this area, as, after all, it is pretty obvious that Microsoft is sitting on a mine of knowledge inside Bing.
And to top it all off, you can also use R inside Azure ML. By my count, as of today, Azure ML comes with about 410 pre-installed packages on top of R 3.1.0 (surprisingly current). It has ggplot2 (yes!), plyr and dplyr, car, datasets, HMisc, MASS, and all the more commonly used data mining packages like rpart, nnet, survival, boot, and so on. If you want to find out which packages have been included in Azure ML, simply create a little experiment like the one I did, shown here, execute this tiny bit of R code, and save the resulting CSV on your machine. Column 1 will show all the included packages.
What if your favourite R package (eg. ROCR, or nleqslv) is not in there yet? Well, the documentation is a bit confusing. It says that it “currently” is not possible to install your own packages, but then it proceeds to show you how to do precisely that, by using a workaround with a package source zip file—see the code at the bottom of this, which demonstrates how to use install.packages() while referring to a file passed into the Execute R Script task.
The key to understanding the value of having R as part of Azure ML is, in my opinion, not just in having access to the lingua-franca of statistics and analytics, but also in how fast and painless it is while processing your data, especially as R is not that great at data wrangling anyway. So, instead of using the venerable RODBC (included) inside your R script, you might consider using Azure ML to do all the heavy-duty data handling (sorry plyr fans) and pass it into your R script as an Azure ML Dataset Data Table, which becomes available to it as a native R data frame: it will magically appear to your R script as an object named dataset. You can add several of those. I have not done my benchmarking yet, but anything that improves the performance of R on larger data sets is always great to have. This also seems like an obvious advantage to a cloud service provider, as opposed to a shrink-wrapped software vendor, and I can imagine Microsoft have a good few tricks up their sleeves when it comes to making Azure ML R sing with Azure-based data sets, even if, at the moment, those are limited to 10 GB in size.
With or without R, you may now have an experiment that works, and which you may want to use as a building block inside your web-savvy application. Perhaps you have just built a recommender system. In Azure ML terms, you have an experiment that uses the Scoring (predicting) task. You identify which of its inputs (ports) should be matched as a Publish Input to your web service, and similarly, what should become the Publish Output. They will appear as little green and blue bullets on the outlines of a task. You re-run your experiment once more, and you use Studio ML to publish it as an Azure ML Web Service. Now you can consume it via the Azure ML REST API, as a simple web service, or as an OData endpoint. This API provides a Request Response Service (RRS) for low-latency synchronous uses, for making predictions, and asynchronous Batch Execution Service (BES) for the retraining of a model, perhaps with your future, fresher data. The API self-documents sample code that you can copy-and-paste to use in a Python, R, or a C# application, or pretty much anything else, as after all it is just REST and JSON. There is a neat little test page that lets you enter the values required by your newly minted service and which performs a test prediction.
There are additional features designed for real-world production purposes, like preventing any component of your experiment (tasks etc) from being automatically upgraded should Microsoft decide to change (and maybe break!) them in the future—well done Microsoft, this is something of a bug-bear for any web-based system maintainer. You can stage your service updates, while they are in production, and you can also configure security, through an API access key.
How much does it all cost? Bearing in mind this is the preview pricing model, it seems relatively attractive. There are two fees, the per-hour active compute and a per-web-service API call fee, both prorated. Per-hour fee is lower whilst you are using ML Studio ($0.38/hour) and a little higher when in production via ML API Service ($0.75/hour). The per-API calls are free while in ML Studio and cost $0.18/1000 predictions while in production. If anything, this is an interesting and a blissfully simple model, not something Microsoft have been known for. I am keen to find out what my developer customers think, as I feel there is a great value to effectively reselling Azure ML as part of your own web application, with little to maintain, other than the goody bits themselves.
How do you get started? Visit azure.microsoft.com, sign-in, and create a workspace in New/Data Services/Machine Learning. Then go to its Dashboard and click the Sign-in to ML Studio link. After reviewing the tasks that make up an Experiment, I would suggest you just select one of the many Samples, create a copy of it, and start running it. If it works, follow the steps above to publish it as your first predictive web service.
Of course, please make sure you don’t miss our upcoming videos and articles on this subject: make sure you are a member, so you get our information-packed newsletter. Best of all, follow our in-depth data science course, which not only covers Azure ML, but also a good dose of R and SSAS. SSAS is also further discussed in our data mining training, and the modules focused on data preparation and validity fully apply to Azure ML, too.
Enjoy machine learning and data science!
Rafal