Starting Data Science projects with the right foot
Roberto Anzaldua Gil
4 Aug 2021
4 August 2021
6 min read
Are you about to start a new Data Science project? Here’s a guide to use before you start looking into data, machine learning approaches and software engineering.
Before you continue, think for 10 seconds, who is running the project?
Are you confident in your answer? If your first thought was that an algorithms expert should be running the project, you get this one wrong. The individual running the project needs to be someone who really understands the business, someone who can articulate what brings value and make the hard decisions about what should be developed.
Wait, so what’s the point of algorithm experts then?
There needs to be a synergy between the individual running the project and the algorithm experts. The synergy works as follows:
The individual running the project is responsible for determining what is valuable, what questions they would like to answer, and how the outputs are going to be used. The algorithms experts are firstly in charge of determining if this is a project that requires machine learning. Sometimes an analytics dashboard is all they need, or sometimes the outputs can be calculated perfectly without machine learning. If this happens to be a good project for machine learning, the next step is explaining which use cases are possible, how long they take, a very high-level view on what performance can be expected,
Discussions between the business team and the algorithms team are important. The more they understand each other, the easier it’s going to be to make the right decision for the budget, the data available, the performance required, how ethical and fair is the project and all other considerations to make it a successful project.
OK, so that was the summary. Getting deeper into certain aspects, let me start with an obvious but quite often forgotten activity: the use cases. Being able to explain how exactly they are bringing value and why they should be pursued, basically the end goal,not the means. If you have a clear understanding of this and you make sure that both the business team and your algorithms team understand as well as you do, then the project is starting on the right foot.
Is this an ML task?
The next point that I would like to emphasize is: making sure that the task is appropriate for a machine learning algorithm. It can be extremely tempting to go head-on and use machine learning algorithms for all sorts of tasks but if you are able to determine your outputs with consistent accuracy without a machine learning algorithm, then the project might not require ML. I’ll illustrate with an example: if you are planning to make a prediction model for sales but you can already predict accurately by looking at historical data, then having a prediction algorithm might increase the accuracy a little, which may have almost no benefit to you. In this scenario, the issue is that implementing the algorithm, framing, bringing it to your systems, creating the different interactions, might be a high cost to pay for a very small gain.
Dashboards and data visualization are not machine learning, nor is it recommended to have a data scientist building this type of tool. This is a software engineering task. Data scientists will be great at suggesting which visualizations might be more suitable when they are paired with the business professionals, but ultimately it is ideal if the software engineer builds the project.
Who are the users, where is it going to live, how is it going to be presented?
Knowing the outputs is a good start, but you also need to make sure that you understand who your users are and where/how this is going to look. For example, is it substituting something you already have? Is it easy to “just replace the previous output”? If it is a new output, how is it going to be used?
Can your current infrastructure support the new system? Do you know what integrations are needed for it to live in your ecosystem?
You also have to make sure you pick the right visualization for what you are presenting so that users can really understand what is going on.
Models will never be perfect. What are you going to do for the cases where the model performance is bad? Is it something you can live with?
Remember that the model is likely to work better on some scenarios, and probably worse in others. If you do a deep analysis, you can even decide if the model should be used in some cases and excluded for others.
Even if you are not replacing anything, you have to think about a reasonable baseline. Ground truth can be your baseline if you can calculate it, but sometimes it’s not available. The point is to create a reasonable baseline. For example, if you are predicting sales, your baseline is the formula you had before you wanted to introduce a model to make these predictions. If you don't have one, try to make a reasonable one that does not require a model, for example, using the average of the last two years.
Baselines are important because they give you the tools to answer:
What will the performance be if I wasn’t using machine learning?
Which is key to understanding the value of introducing machine learning models. Note that in some cases it is difficult to have a baseline. In the case of classification models, you can only estimate the benefit of the model if you were manually classifying the items before the introduction of the model.
Algorithms optimise for a target. For example, mean absolute error, accuracy, recall, etc.
When selecting algorithm targets, the key factor is what will help advance the business. That's great for algorithms, but you should find a way to estimate value in terms of the metrics that you care about and that are more easy to understand for stakeholders. Continuing using the sales prediction scenario:
What do I get as a business if the model I am using is 20 % better than my formula? How much do I save?
Making sure that you understand this is crucial to knowing whether it makes sense to make the model in the first place. Secondly, if you are considering improving your existing machine learning model, it helps you to determine whether you should be spending more time finding better features.
What is your testing plan?
Testing is very specific to the problem and models you are using. Testing is very important to make sure that your model is robust enough to minimise the risk of getting undesirable results once you move it into production. The following list contains useful strategies but it is not complete by any means:
Test historical scenarios. This means using historical data, using the outputs of the machine learning model, and evaluating how much it would have helped considering hypothetical scenarios. Performance across time. Models often need to be tested across time to understand how they perform in different situations. This means that you have to take various periods of time, and for each period, do your training, and evaluate the performance. To illustrate this point using the sales model again: is your model performing well during high seasons, do you notice seasons that are easier to predict? Analysing weaknesses of the model. Never focus on a single global metric. Check metrics on different situations. For example, does your model perform way better on weekdays compared to the weekends? If you are predicting categories, is there a particular category where your model just doesn’t work? Or a category where it works amazingly? Plan A/B testing. To make sure that your model is delivering better results than what you had before, it is important to introduce A/B testing. This means to test your model and your baseline under the same conditions and evaluation metrics. You can also use it to test different models and see which one produces better outcomes.
A final note on testing: prioritize tests that are relevant to what you are trying to achieve. If you are making a sales prediction model, make sure that you test that the performance is consistent across time, and that it is what you are expecting it to be. Don’t over focus on scenarios that are very unlikely to happen.Let’s elaborate, imagine that this sales prediction model is shown in a web application that is going to be viewed by a maximum of 10 people simultaneously. Testing that your environment supports 200 people at the same time is not useful, nor should it be a current concern. You can test that your environment supports at least 20 people, just in case it is needed for the future, that is sufficient and reasonable.
Why is monitoring important? In a nutshell, because we live in an ever changing world that can drastically change the performance of models when unprecedented situations happen. Think about how the current pandemic changed shopping and investment habits. As mentioned in this article , we can see how various models dropped their performance drastically, becoming chaotic.
How do you monitor your model? Create minimum performance metrics and whenever they are not met, the system needs to notify you. Prepare a backup plan to decide when you should shut down the model completely and ideally, you have another way of replacing the model temporarily.
Is it ethical? Is it fair?
Think about the individuals affected by your model and if the model is fair for them. Evaluate different groups and check whether your model is favoring certain groups, and also, as much as possible, make sure that you use data points that are ethical. For example, if you are using location as a data point, that can be a proxy for income and race. Go to this article for an extended analysis about ethical and fair algorithms.
In this article we reviewed some considerations to take before even starting with a data science project. Remember, it is not extensive and not a complete guide. It is meant to help you to start your projects on the right foot!
I hope you enjoyed this article, I’m very passionate about data science, so, if you are interested in any specific topic, feel free to message me.
Roberto Anzaldua GilSee other articles by Roberto