Machine Learning Project Tips

There is a ton of good literature and resources about Machine Learning these days. What I feel is missing there is usually more kind of a real-world guidelines or tips how to get from from bunch of data and fuzzy assignment to a feature that works for the business.

I will try to share some things I learned while working on such projects. The tips mentioned in this post should be especially helpful if your project matches the following criteria.

The project goals is not clearly set and falls into category "everyone is doing ML, we should do too..." . This kind of projects tend to be given R&D label.
Your team has not much experience with ML domain and have no senior ML engineer that would be able to guide you.

Understanding data

The crucial point of doing any Machine Learning project is understanding the data you have.

If you are not an expert in the domain, then you should get in contact with a domain expert in your organization or outside to get better understanding of the data using his/her domain knowledge.

Look on different statistical parameters of data. Draw histograms, PCA can be helpful too. I found myself doing lots of visualizations an absolute must.

Ensure the things you are looking for are actually in the data. It might be frequently a case that e.g. you are trying to search for some pattern that is actually not there. So you will spend a lot of time searching for a ghost.

Confirm or disprove your original assumptions about data based on visualizations and statistics.

Find a real issue or pattern and set your anchor there

What is typical mistake is just taking your dataset and more / less blindly try to apply stock ML algorithms that might look to fit to data you have. What you should do is actually first understand the data available to you.

Use the neural network embedded in your head to find samples / subsets of your data, where you can e.g. see an example of an anomaly you are trying to detect, pattern or correlation you are looking for.

If you were able to find such interesting samples, then set your anchor there, explore around. This is the point where you should start thinking about some machine learning algorithms.

The anchor is set

If you were able to get there and wrap your mind around interesting samples in data, then you are on the best way to succeed.

Now is the time to really put some machine learning sauce in. You know how what you are searching for looks so you have now much narrower field of ML options to pick from.

Toolset

We are engineers and we love our tools. The machine learning compared to standard software development process requires much more going back and forth to complete at least first successful iteration. Therefore you need lightweight toolset that will not stand in your way.

Personally I found python, pandas, numpy and jupyter as great toolbox to experiment, visualize and calculate. I assume the same applies to R and its friends. You will have to do a lot of data slicing, dicing and cleanup. The python and pandas are great tools for that. Jupyter notebooks will help you to quickly prototype and visualize the results.

I do not recommend directly jumping into more complex tool-chains like Apache Spark, TensorFlow or other frameworks / libraries. You might use e.g. Zepellin as Jupyter alternative, but I did not find it as flexible and you are bringing additional complexity into the picture. Simplicity in your tool-chain will help you to focus on your main task.

The time to write your own algorithm implementation or apply some hi-perf library or toolchain will come. But you it should be at the moment you have a working prototype using basic tools.

What next ?

To keep it short, I will stop here for today. In coming post I will try to share more tips I found practical regarding working with unlabeled data and some of my insights broadly used algorithms for clustering, sequence mining and prediction.

Search This Blog

#lkolisko /dev/engineering