Project Botticelli

Next Year in Machine Learning, Data Science, AI and BI Get Free Access Purchase this course

24 December 2019 · 2034 views

The Future Series (2019)

ML, BI, DS, and AI Trends that Shaped 2019

This free video, just published, at the very end of 2019, takes a look at the key trends that have shaped the year at the intersection of data science, business intelligence, machine learning, and advanced analytics. As is the current fashion, I—Rafal Lukawiecki—would also like to share a few predictions about the future of our industry, looking towards 2020 and slightly beyond. If you are interested in specifics of what is new and changing in the world of Microsoft machine learning technologies, notably the very new Azure ML, make sure to also watch the next video in this series, Microsoft Machine Learning Technologies: View Towards 2020, which is available to subscribers.

The first trend that I discuss which has characterised 2019 is the pervasive nature of the SQL language, which is still at the very top of all rankings of the tools used in our industry—even more popular than Python or R. This is no surprise, as it fulfils many roles, notably in data preparation, which accounts for majority of the cost and effort of every data science or AI project. I believe this trend will continue for at least another decade.

It is interesting, but not surprising, to see that the hype about big data has passed: it is nothing special anymore. On a positive side, some of the tools created for that gone-by era kept on getting easier to use by climbing up the abstraction ladder. While Hadoop is on the decline—and you definitely should not be considering it any more except in rare situations—it did give rise to Spark, which, in turn, also became more usable and closer to our needs as it evolved from v1 to v2—currently 2.4.4, and v3 in preview. Current Spark builds upon its older, low-level APIs, and offers simpler, closer-to-the-user concepts, like Tables that any SQL user will recognise, or Data Frames, analogous to those natively found in R or in Python’s Pandas.

This natural trend of technology making its abstract constructs more usable and relevant to majority of us has been matched by the beginning of the coalescence of the wide variety of machine learning frameworks. Far fewer have been appearing recently than in the previous few years, and there has even been a slowdown in the creation of new language-specific packages, like those found on R’s CRAN. Perhaps we have reached an inflection point, when quality will take over quantity. In any case, this a good trend, as it means that there is less unnecessary repetition, effort is more focused on stability, and the longevity of those frameworks is more assured. Anyone even remembers Mahoot? I did some commercial work in it a few years ago…

Data science has just passed its peak of hype. Unfortunately, there will be both the unavoidable disillusionment that follows any hype, and another, a more justified one because of several negative trends that underlined the last two years: proliferation of unreliable machine learning models. It is easy to build something that appears to work and is even characterised by a high level of accuracy whilst in the confines of our development environment, using known—sometimes too-well known—datasets. At a recent NeurIPS academic AI conference Google top researcher, Blaise Aguera y Arcas, and others heavily involved in deep learning development, said that we have reached the limits of deep learning—read this Wired article for a neat summary. Besides, we all know and read constant reports of racial and other biases in facial recognition, or how adversarial machine learning easily defeats even complex models.

I have never been a fan of deep learning for purposes other than recognition—please watch my video on Artificial Stupidity—but I am glad, if surprised, that the biggest industry players acknowledge what we have known for a long time: it is our data (or the lack of it) that ultimately limits what an algorithm can discover. As I have been banging on for some timeyour data and a wise choice of a clear business goal is far more important than the algorithm or technology.

Recognition, especially based on deep learning, still smacks of a case of some model overtraining to me. Indeed, that may even be necessary: those ethically questionable facial matching applications that proliferate in repressive countries, like China, need to recall a specific case, a person, not generalise about it. There is little intelligence, or generalisation here, it is all more of a novel way to do fuzzy search on a vast image data set. Memory and recall is not intelligence, but it certainly helps a lot.

However, even when the models are supposedly expanded to do recognition of traits things do not work well. They are naturally limited to highly biased and incomplete data they have been fed. Claiming that a so trained algorithm can be used to detect criminal intent from a photo of a person is shockingly inhuman, unethical, and I wish it were illegal—except that it furthers the goals of the repressive regimes and their paymasters, feeding a lie explained with garbage science. Sadly, that is also what I fear from the current quest of trying to explain the meaning of blackbox ML models by using so-called machine learning explainers or interpreters, like LIME. We should be relying on models that we can explain directly, like simpler decision trees or regressions, than the ones that require a ton of questionable technology to give us a still unsatisfying and highly uncertain explanation.

One way or another, a common-sense business user simply observes that “all that fancy ML stuff my guys are building never really works when we try using it in the real world”, to quote one of my customers. However, it is not hard to make it all work—

Small-data, statistics-first, simpler ML, aka statistical learning, has been quietly achieving business goals and that trend is set to continue. I believe the integration of ML and data science into popular business intelligence tools, like Microsoft’s Power BI recently, is a good direction. This trend matches the need of business users who look for further insight in their already-understood data, perhaps looking to check their confidence in a hypothesis, or simply trying to picture a pattern in their mind. I think that this is a good time to reap the benefit of combining data science, especially on classical, smaller data sets, with modern BI. 

From the perspective of a data scientist, data or an AI engineer, or just a data miner to use my former job title, I see another positive trend for 2020 and beyond: growth of all-in-one ML environments where I can do everything I need, using the many tools and technologies I like, being able to provision cheap-and-tiny or massive compute as needed, and all of it for a reasonable price. After several—three that I am aware of—false starts, Microsoft Azure Machine Learning is finally on a good path. I like how it integrates all the needed tech under one roof, while appealing both to the code-first and the GUI-first development approach. It is still a little buggy and its responsiveness and speed, especially on small data sets, is poor, but I am sure this will improve soon. Please watch the next video for an in-depth look at what is new in Azure ML, ML Server and other Microsoft machine learning technologies.

My final predictions for 2020 and beyond focus on the changing affiliation of data science, moving away from IT and, at long last, into the business domain. We are already seeing that more business-minded analysts use R than Python, with Python becoming the stronghold of the more formally trained software developer in the roles of data and AI engineering. I think this is good because being closer to business means that data science projects are both more likely to be properly funded and managed, and, above all, they will have clearer, focused business goals, which is the critical success factor in our industry.

If you have been wondering where to put your time and learning effort to be better aligned with the likely future, may I respectfully suggest that you de-prioritise low-level, overly complex platforms, like Hadoop, and focus on both the older and the newer ones that speak a language closer to the needs of the business, like R, Spark, or DAX and Power BI. Make sure to learn about reproducible research, brush up on statistics, focus on clear business goals, spend less time obsessing about technology, trust human common sense rather than the machine, and above all, please stay ethical and legal.

Happy New Year everyone!

Rafal

Log in or register for free to access this content.

Purchase This Course or Full Access Subscription

Single Course

$190/once

Access this course for its lifetime*.
Purchase
Subscription Best Value

$480/year

Access all content on this site for 1 year.
Purchase
Group Purchase

from $480/year

For small business & enterprise.
Group Purchase

  • Redeem a prepaid code
  • Payment is instant and you will receive a tax invoice straight away.
  • Your satisfaction is paramount: we offer a no-quibble refund guarantee.

* We guarantee each course to be available for at least 2 years from today, unless marked above with a date as a Course Retiring.

Online Courses