Project Botticelli

How to Succeed with Your First Data Science Projects

25 October 2016 · 2219 views

Dos and Don'ts

I have had my share of successful and failed projects since I have embarked on data science ten years ago, back in the days when it was called data mining, and later became predictive and then advanced analytics. While I am happy to say that the rate with which I now succeed on customer projects is much better than in the past, that is not just because I know my field better. It is because I am better at setting my own and my customer’s expectations, and by being more careful in choosing the projects that I want to dive into. I would like to share some of my observations with those of you who are newer to this field. I would like to save you some frustration and to help you succeed as often as possible. If you have a chance to join me on one of my week-long hands-on data science courses, you will hear this subject discussed in more detail, but if not, read on, and please also have a look at my other article, which explains the key reasons why my customers need data science.

The most important success factor is having a useful, a reasonably well-defined business goal, one that is desired and supported by the decision-makers. In my experience, two-thirds of the projects that I have never taken past the initial stage—that is not past a 2–3 day workshop-based consultancy—are the ones that, in my opinion, would have failed because they were more of a technology, or a conceptual exercise, than work geared at the delivery of tangible business benefits. Indeed, my very first commercial data mining project, some nine years ago, which was a failure, involved trying to find a use for all these cool machine learning algorithms rather than trying to address a genuine business need. We were looking for a problem to solve using the tools we liked. As it happened, the customer was a self-acknowledged (and fun to work with) risk-taker who wanted to just “try it out”—I eagerly went ahead. Today, I do not undertake such work, and instead I might suggest to such a customer some tried-and-tested ideas to ignite their creativity, first. I like helping them brain-storm, very much, but I would not promise anything unless I had a clear business need in front of me.

This, of course, is a major concern to me when it comes to that horribly misnamed technology of big data—it is never the size that matters, and not of the data, but of the complexity of the question, data, and your abilities—but let’s not digress from the theme of this essay. You see, big data is so much a technology in search of an application that people are prepared to manipulate their data just to get anything out of it!

If you torture data long enough it will tell you anything said one of the greatest statisticians, John W Tukey. In a highly-dimensional space of a feature-engineered data set, especially one that has been significantly oversampled, like when building a fraud detection system, we are always on the lookout for early signs of overfitting, failed cross-validations, or spurious, hard-to-interpret patterns. Add to that even more data, apply a non-transparent, impossible to understand the-meaning-of-what-came-out-of-it neural network and voilà: you have just “succeeded” with big data. Unfortunately, even if it seems to work, such models are usually short-lived, require so much constant update that I would question their value beyond being very temporary summaries of the most recent events, unable to generalise something more important and underlying.

I am not suggesting that it is impossible to build something complex or that big data is useless. Just look at the phenomenal success of recurrent neural networks in statistical language translation or convolution nets in image recognition. But I would definitely not suggest either for your first projects. The temptation to claim success when it is a fleeting overfit that has just passed your tests is something I have seen—and I have experienced, first hand…one learns from mistakes more than from a success.

Later on, when you have developed a stronger feeling why simplicity trumps complexity, why transparency beats a blackbox, and why we need to be weary and critical of big data, that would be just the right time to apply these cool tricks to your work. By then even the technology might move from its now poorly-documented, buggy beta—let’s hope. So no, do not start with a collaborative filtering recommendation engine before trying association rules for market basket analysis, and do not build rare event analysis systems with minute percentages of the things you want to find in a sea of noise…but do go ahead and work on that a little later, when you have succeeded at something more important, and easier, first.

Before I suggest to you what would make a good first project, let me mention another bad first one. Revenue or sales forecasting. Everyone wants us to build a system that can predict how much money your business would earn, or how many employees it needs to hire to staff its cash desks next quarter. Of course we can do it. There are popular packages, like forecast in R (see Forecasting: principles and practice by Rob J Hyndman and George Athanasopoulos) and plenty of well-understood theory on time series data. We had ARIMA/ARTXP implemented in SQL Server Analysis Services Data Mining since 2008! So why not?

Because your real-world goal is never to just build a forecasting model. Your real, practical goal is to build something that is better than what your business already has. You will not impress your boss by building a forecast that is 80% of what they can already do. And good they are—businesses have honed their forecasting over decades: combining numbers, historical observations, intuition and useful human checks-and-balance simulation exercises better known (and hated) under the name of iterative bottom-up budgeting. Yes, that mind-numbing thing the whole corporation engages in on a yearly basis is, actually, pretty good at generating not only a reasonable forecast, but it also instills a corporate behavioural mechanism that ensures its disciplined compliance. How can one possibly succeed with a naive “I can predict your sales with R2 of 80%?” You cannot. You would either need to be better than what the business already has, beating it, not nearing it, or you would have to become a player in the process, supplying useful, believable facts and assertions that would enhance the existing forecasting processes. Do it, by all means, but do it when business already trusts you, not when you are still trying to prove yourself.

Indeed, if you want to succeed with your early data science projects, focus on building the trust that the business yearns for. Deliver a few easy, low-hanging fruit projects, that are useful, and needed, by your colleagues, and do not aim too high. Get those into production, or at least into the perception of the decision-makers, and keep building on that foundation until the time has come to try something riskier, something that is more likely to fail, but that could also deliver more value if it succeeded.

A few ideas for you to consider. Start with clustering for segmentation, maybe EM rather than k-means, and just get to see your data in some fuzzy, overlapping real-world clusters. In retail or anything consumer-focused try customer lifetime-value modelling using a decision tree, or their propensity to buy product X. Do sequence clustering of key customer events, like purchases, returns or service calls, or the lack thereof, to see how they transition through your organisation and why do they eventually churn. Do a good market basket analysis—good, that is at an appropriate level of a product hierarchy to be meaningful to business. In finance, model the risk of default of credit or the likelihood of contract failure or perhaps again, lifetime, or whole-period earnings. Look for outliers and small clusters of repeatable patterns that the business is not aware of. Try visualising, with a simple decision tree, the profitability of your business-to-business relationships, or their cost. In healthcare, consider the popular risk of hospital readmission analysis—not because the risk is such a useful metric, but because you will end up visualising the reasons for some clusters of readmissions that the hospital was not aware of—itself a superb outcome to hope for. In transportation, understand the clusters of traveller types, their needs, their typical journeys but also look for their complaints and try to validate to the business why increasing the satisfaction of suffering commuters is good for all involved. In logistics look for the key reasons for bottlenecks. In entertainment see what promotions which customers really like and why… The list goes on and on, but have you noticed that none of these projects are moon-shots of any kind? If they seem a bit too easy, perhaps trivial: great. As long as your business needs them, this is the right place to start.

Let me leave you with a couple more thoughts. Anything you can understand and explain in human terms is better than a blackbox, even a more performant one. Yes: I prefer a simple, old-fashioned decision tree that I can beautifully explain to a business user to a more accurate but inexplicable neural network—which I can also later use. The time for more accuracy will come in the future, but now focus on trust, and transparency offers it. The more you make data science seem like a simple concept, the less it looks like magic, the quicker you will be understood, and believed—and the success will come. Many of my customers tell me how they like what I invoice them for because of the simplicity and straightforward nature of the results. No one has ever complained that my model did not seem complex enough.

Finally, do not feel that you have to deploy your model to production. Do not feel pressured into build a web service, a Shiny package or anything at all. Six out of ten of my projects are valued by my customers for the very knowledge they bring. Knowing why some almost-loss-making pizza sauce is an important product just before they were going to demote it from shelves saved money. Knowing that a hi-fi speaker would sell better if marketing focused on an overlooked product was worth a lot of money and found new customers. Not closing a bank branch because it increased lifetime wealth of a few important customers was a surprise that made perfect business sense—and in none of those cases I needed to hurt a single web service. Of course four out of ten of my customers do deploy to production: nightly churn predictions or VIP customer promotions, etc.—but many are just delighted with the insights data science gave them into understanding themselves, something that would be easy and not too risky to deliver as one of your early projects.

Good luck with data science—and do consider joining me in the classroom one day: I have already trained some 200 new data scientists in just the last 12 months, many of whom delivered excellent first projects: so good, that they even became poster children for Microsoft and their own organisations.

Online Courses