How to Build a Machine Learning Project - A Step by Step Guide

A Step by Step Guide on How to build a Machine Learning Project

There are a lot of steps here. But don't treat this as an exhaustive list. We've only added steps for the most-common steps you'll come across while building a machine learning project. The number one thing to remember is: machine learning is experimental. Try something, see if it works, if not, try something else. Your primary goal for machine learning problems should be reducing the time between your experiments. If something is missing, please let us know.

Core steps required for building a machine learning project are

Machine Learning Projects - How to build and What are the Steps

1. Problem Definition and Data Collection
2. Data Preparation
3. Model Training 
4. Analysis (Accuracy, Precision and other measures)
5. Model Deployment
6. Retrain the Model

Let’s dive a little deeper in each step.

1. Problem Definition and Data collection

To define a machine learning problem, you need to know this three types of machine learning and the problem that comes under each type

1st Type : Supervised learning

Supervised learning, is called supervised because you have data and labels. A machine learning algorithm tries to learn what patterns in the data lead to the labels. The supervised part happens during training. If the algorithm guesses a wrong label, it tries to correct itself.

For example, if you were trying to predict heart disease in a new patient. You may have the anonymized medical records of 100 patients as the data and whether or not they had heart disease as the label.

Problems with labels come under supervised learning

2nd Type : Unsupervised learning

Unsupervised learning is when you have data but no labels. The data could be the purchase history of your online video game store customers. Using this data, you may want to group similar customers together so you can offer them specialized deals. You could use a machine learning algorithm to group your customers by purchase history.

3rd Type : Transfer learning

Transfer learning is when you take the information an existing machine learning model has learned and adjust it to your own problem. 

For example, in training a classifier to predict whether an image contains food, you could use the knowledge it gained during training to recognize drinks

Now, you can identify the problem. So, Let's move towards data collection

- Data Collection

Find the data sources you will use in your ML Project. Now as you have data sources, its time for data collection

Ask yourself: What type of data you have? How does it match the problem definition? Is your data structured or unstructured? Static or streaming?

Different types of data you should know

Mainly two types of Data are there i) Structured Data ii) Unstructured Data

1. Structured Data: Data which appears in tabulated format (rows and columns style, like what you'd find in an Excel spreadsheet). Can contain different types of data, for example, numerical, categorical, time series.

Different formats of structured data are

a. Nominal/categorical: One thing or another (mutually exclusive). For example, for car sales, color is a category. A car may be blue but not white. Order does not matter.

b. Numerical: Any continuous value where the difference between them matters. For example, when selling houses, $103,000 is more than $50,000

c. Ordinal: Data which has order but the distance between values is unknown. For example, questions such as, how would you rate your health from 1-10? 1 being poor, 10 being healthy. You can answer 1, 2, 3, 4, 5, 6, 7, 8, 8, 9 or 10 but the distance between each value doesn't necessarily mean an answer of 10 is 10 times as good as an answer of 1.

d. Time series: Data across time. For example, the historical sale values of Bulldozers from 2014-2020.

2. Unstructured Data: Data with no rigid structure (images, video, natural language text, speech).


2. Steps for Data preparation

Their are three steps for Data Preparation

i. Exploratory data analysis (EDA) - Learning about the data you're working with
ii. Data Preprocessing - Preparing data for model training
iii. Data Splitting - Splitting data into sets

i. Exploratory data analysis (EDA) - Learning about the data you're working with.

Steps for EDA

1st Step - What are the feature variables (input) and the target variables (output)? : For example, for predicting heart disease, the feature variables may be a person's age, weight, average heart rate and their level of physical activity. And the target variable will be whether or not they have heart disease (note: this example has been very simplified). 

2nd Step - What kind of data do you have? Structured, unstructured, categorical, numerical : Create a data dictionary for what each feature is. For example if you've got a column of numbers called "hr", how would someone else know that actually means Heart Rate?

3rd Step - Are there missing values? Should you remove them or fill them with feature imputation (see below)?

4th Step - Where are the outliers? How many of them are there? Are they out by much (3+ standard deviations)? Why are they there?

5th Step - Are there questions you could ask a domain expert about the data? : For example, would a heart disease physician be able to shed some light on your heart disease dataset?

Find the answers of the above given questions and implement the solution step by step as per your requirement

ii. Data Preprocessing

After learning about the data you're working with, its time to move towards Data Preprocessing : Preparing your data to be modelled

👉 1st Step of data preprocessing is Feature imputation: To fill the missing values (a machine learning model can't learn on data that isn't there).

Types of Imputation you should know

a. Single imputation: Fill with mean, median of column.

b. Multiple imputation: Model other missing values and fill with what your model finds.

c. KNN (k-nearest neighbors): Fill data with a value from another example which is similar.

d. Many more, such as, random imputation, last observation carried forward (for time series), moving window, most frequent.

👉 2nd Step of Data Preprocessing is Feature encoding : Turning values into numbers (A machine learning model requires all values to be numerical)

Different types of Feature Encoding you should know

a. One Hot Encoding: Turn all unique values into lists of 0's and 1's where the target value is 1 and the rest are 0's. For example, with car colors green, red, blue, a green car's color feature would be represented as [1, 0, 0] and a red one would be [0, 1, 0].

b.  Label Encoder: Turn labels into distinct numerical values. For example, if your target variables are different animals, such as dog, cat, bird, these could become 0, 1, 2 respectively.

c. Embedding encoding: Learn a representation amongst all the different data points. For example, a language model is a representation of how different words relate to each other. Embeddings are also becoming more widely available for structured (tabular) data).

👉 Third Step of Data Preprocessing is Feature normalization (scaling) or standardization :  When your numerical variables are on different scales (e.g. number_of_bathrooms is between 1 and 6 and size_of_land between 7,000 and 19,000 sq. feet), some machine learning algorithms don't perform very well. Scaling and standardization help to fix this

- Feature scaling (also called normalization) shifts your values so they always appear between 0 and 1. This is done by substracting the min value and dividing by the max minus the min. For example, 1 and 4 bathrooms would become something like 0.3 and 0.6 respectively (between 0 and 1 but the numerical relationship is still there).

- Feature standardization standardizes all values so they have a mean of 0 and unit variance. It happens by substracting the mean and dividing by the standard deviation of that particular feature. Doing this means values do not end up between 0 and 1 (their range is uncapped). Standardization is more robust to outliers than feature scaling. 

👉 Fourth Step of Data Preprocessing is Feature engineering : Transform data into (potentially) more meaningful representations by adding in domain knowledge.

Different ways to represent the data are

a. Decompose - Turn a date (such as 2021-03-31 18:00:36) into hour_of_day, day_of_week, day_of_month, is_holiday, etc. 

b. Discretization - Turning larger groups into smaller groups

- For numerical variables, let's say age, you may want to move it into buckets, such as, over_50 or under_50. Or, 21-30, 31-40, etc. This process is also known as binning (putting data into different 'bins').

- For categorical variables such as car color, this may mean combining colors such as 'light green', 'dark green' and 'lime green' into a single 'green' color

c. Combining two or more features - Difference between two features, such as using house_last_sold and current_date to get time_on_market.

d. Indicator features: Using other parts of the data to indicate something potentially significant

- Create an X_is_missing feature for wherever a column X contains a missing value.

- If you're analyzing traffic sources from the web, you could add is_paid_traffic as feature for advertisements which have been paid for.

- Does a particular sample satisfy more than 1 criteria? Such as if you were analyzing car sales data, does the car have less than 100,000 KM's, is automatic and under 10-years old? Perhaps you know from experience these cars are generally worth more. You could make a special feature called under_100km_auto_under_10yo.

👉 Fifth Step of Data Preprocessing is Feature selection: Selecting the most valuable features of your dataset to model. Potentially reducing over fitting and training time (less overall data and less redundant data to train on) and improving accuracy.

- Dimensionality reduction: A common dimensionality reduction method, PCA or Principal Component Analysis takes a larger number of dimensions (features) and use linear algebra to reduce them to less dimensions. For example, say you have 10 numerical features, you could run PCA to reduce it down to 3.

- Feature importance (post modelling). Fit a model to a set of data, then inspect which features were most important to the results, remove the least important ones.

- Wrapper methods such as genetic algorithms and recursive feature elimination involve creating large subsets of feature options and then removing the ones which don't matter. These have the potential to find a great set of features but can also require a large amount of computation time. TPot does this

👉 Sixth Step of Data Preprocessing is Dealing with imbalances: Does your data have 1000 examples of one class but only 100 examples of another? What to do in such situation?

- Collect more data (if you can)

- Use the scikit-learn-contrib imbalanced-learn package (a scikit-learn compatible Python library for dealing with imbalanced datasets)

- Use SMOTE (synthetic minority over-sampling technique) which creates synthetic samples of your minor class to try and level the playing field. You can access an implementation of SMOTE contained within the scikit-learn-contrib imbalanced-learn package

- A helpful (and technical) paper to look at is "Learning from Imbalanced Data". You could use this as a starting point to go onto the next thing.

- Data Splitting

Your data is now prepared for modelling. But before training the model you've to split data into sets

Different types of sets you should know

a. Training set (usually 70-80% of data): Model learns on this.

b. Validation set (typically 10-15% of data): Model hyperparameters are tuned on this

c. Test set (usually 10-15%): Models final performance is evaluated on this. If you've done it right, hopefully the results on the test set give a good indication of how the model should perform in the real world. Do not use this dataset to tune the model.


3. Train model on the data you've prepared 

Core Steps of training a model:
 
i. Choose an algorithm,
ii. Over fit the model, 
iii. Reduce overfitting with regularization

i. Choosing an algorithm

Many of the algorithms you see here are already implemented for you in libraries such as Scikit-Learn. You always look to use a pre-implemented version of a machine learning model before building your own. If in doubt, search something like "[ALGORITHM_NAME] scikit learn". Every learning algorithm has: a loss function (how wrong the model is) an optimization criterion (a criteria for improving the loss, also called an optimizer, like stochastic gradient descent) an optimization routine which uses training data to find a solution for the optimization criterion (use a plethora of examples of what a model 'should' learn to improve what it 'does' learn)

Different types of algorithms 

Machine Learning algorithms are mainly classified into two types

i. Supervised Algorithms
ii. Unsupervised Algorithms

Supervised Algorithms are further classified as

a. Linear Regression: Draw a line that best fits data scattered on a graph. Produces continuous variables (e.g. height in inches). 

b. Logistic Regression: Predicts a binary outcome based on a series of independent variables. For example, if you're trying to predict whether someone has heart disease or not based on their health parameters.

c. k-Nearest Neighbours: Find the 'k' examples which are most similar to each other. Then, given a new sample, which is the new sample most closely aligned with? 

d. Support Vector Machines (SVMs): Can be used for classification or regression. Try to find best way to seperate data points using multiple planes (referred to as Hyperplanes). Imagine a road running through a city with red cars on the left and green cars on the right, the SVM is the road. With more data, you might need more dimensions (more roads on different angles and cars on different heights). 

e. Decision Trees and Random Forests: Can be used for classification and regression (very valuable algorithm for structured data). Decision trees split data based on criteria such as, "is over 50" and "has average heart rate lower than 65" eventually getting to a point where the data can't be split anymore (you can define this). Random Forests are a combination of many decision trees, effectively leveraging and combining the choices of many models (this technique of using a combination of models is known as ensembling)

f. AdaBoost/Gradient Boosting Machines (also just known as boosting): Can be used for classification or regression. Asks the question, can a series of weak learners be turned into a strong learner? For example, train a series of weak learners whos job it is to improve each other (another example of an ensemble method, combining multiple weaker models to create a better one).

- XGBoost aglorithm

- CatBoost algorithm

- LightGBM algorithm

g. Neural networks (also called deep learning): Can be used for classification or regression. Takes a series of inputs, manipulates the inputs with linear (dot product between weights and inputs) and nonlinear functions (activation function). Do this a multiple of times (at least once for each neuron in a model). The importance of linear and non-linear functions (straight and non-straight lines) means neural networks can use this combination to estimate almost anything.

- Convolutional neural networks (typically used for computer vision).

- Recurrent neural networks (typically used for sequence modelling).

- Transformer networks (can be used for vision and text, starting to replace RNNs).

Unsupervised Algorithms are further classified as

a. Clustering: K-Means clustering: choose 'k' number of clusters, each cluster receives a 
 node (called a centroid) at random and with each iteration the centre nodes attempt to move farther away from each other. Once the centroids stop moving, each sample is assigned a value equivalent to the closest centroid.

b. Visualization and dimensionality reduction

- Principal Component Analysis (PCA): reduce data from more dimensions to lower dimensions whilst attempting to preserve the variance.

- Autoencoders: Learn a lower dimensional encoding of data. For example, compress an image of 100 pixels into 50 pixels representing (roughly) the same information as the 100 pixels.

- T-Distributed Stochastic Neighbor Embedding (t-SNE): good for visualizing high-demsionality data in a 2D or 3D space 

c. Ananomaly detection

- Autoencoder: Use an autoencoder to reduce the dimensionality of the inputs of a system and then try to recreate those inputs within some threshold. If the recreations aren't able to match the threshold, there could be some sort of outlier

- One-class classification: train a model on only one-class (e.g. normal events of computer network traffic, which are usually in abundance), if anything lays outside of this class, it may be an anomaly. Algorithms for doing so include, one-class K-Means, one-class SVM (support vector machine), isolation forest and local outlier factor.

Types of learning you should know

a. Batch learning: All of your data exists in a big static warehouse and you train a model on it. You may train a new model once per month once you get new data. Learning may take a while and isn't done often ("don't stuff this one up"). Runs in production without learning (though can be retrained later).

b. Online learning: Your data is constantly being updated and you constantly train new models on it. Each learning step is usually fast and cheap (as opposed to batch learning). Runs in production and learns continuously.

c. Transfer Learning: Take the knowledge one model has learned and use it with your own. Gives you the ability to leverage SOTA (state of the art) models for your own problems.

- Helpful if you don't have much data or vast compute resources. Use resources such as TensorFlow Hub or PyTorch Hub for broad model options and HuggingFace transformers (NLP models) and Detectron2 (computer vision models) for specific models.

d. Active Learning: Also referred to as "human in the loop" learning. A human expert interacts with a model and provides updates to labels for samples which the model is most uncertain about. 

e. Ensembling: Not really a form of learning, more combining algorithms which have already learned in some way to get better results. For example, leveraging the "wisdom of the crowd".

Now, you have prior knowledge of using different algorithms for various problems. Choose an algorithm as per your requirement and move towards the next phase i.e. fitting the model

ii. Overfitting and Underfitting the Model

There are two types of fitting in machine learning

a. Underfitting: Happens when your model doesn't perform as well as you'd like on your data. Try training for longer or a more advanced model.

b. Overfitting: Happens when your validation loss (how your model is performing on the validation dataset, lower is better) starts to increase. Or if you don't have a validation set, happens when the model performs far better on the training set than on the test set (e.g. 99% accuracy on training set, 67% accuracy on the test set). Fix through various regularization techniques.

Regularization: a collection of techniques to prevent/reduce overfitting.

Steps in Regularization

1st Step - L1 (lasso) and L2 (ridge) regularization: L1 regularization sets uneeded feature coefficients to 0 (performs feature selection on which features are most essential and which aren't, useful for model explainability). L2 contrains a models features (won't set them to 0).

2nd Step - Dropout: Randomly remove parts of your model so the rest of it has to become better.

3rd Step - Early stopping: Stop your model from training before the validation loss starts to increase too much or more generally, any other metric has stopped improving. Early stopping usually implemented in the form of a model callback.

4th Step - Data augmentation: Manipulate your dataset in artificial ways to make it 'harder to learn'. For example, if you're dealing with images, randomly rotate, skew, flip and adjust the height of your images. This makes your model have to learn similar patterns across different styles of the same image (harder). Note: since this can be compute intensive, it's a good idea to do this in memory. See a functions like ImageDataGenerator in Keras or transforms in torchvision

- Image transformation in PyTorch

- Image augmentation in Keras/TensorFlow

- EDA (easy data augmentation): text augmentation with Python (boost performance on text classification tasks)

- TextAttack: framework for adversarial attacks, data augmentation and modeling training in NLP.

5th Step - Batch normalization: standardize inputs (zero mean and normalize) as well as addbing two parameters (beta, how much to offset the parameters for each layer and epsilon to avoid division by zero) before they go into the next layer. This often results in faster training speeds since the optimizer has less parameters to update. May be a replacement for dropout in some networks.

After completion of Overfitting and Underfitting, the next step will be hyperparameter tuning

iii. Hyperparameter Tuning - Run a bunch of experiments with different model settings and see which works best.

Setting a learning rate (often the most important hyperparameter), generally, high learning rate = algorithm rapidly adapts to new data, low learning rate = algorithm adapts slower to new data (e.g. for transfer learning). 

- Finding the optimal learning rate: train the model for a few hundred iterations starting with a very low learning rate (e.g. 10e-6) and slowly increase it to a very large value.

- Learning rate scheduling (use Adam optimizer) involves decreasing the learning rate slowly as model learns more (gets closer to convergence).

- Cyclic learning rate: dynamically change the learning rate up and down between some threshold and potentially speed up training

Other Hyperparameters you can tune

- Number of layers (deep learning networks)

- Batch size (how many examples of data your model sees at once). Use the largest batch size you can fit in your GPU memory. If in doubt, use batch size 32 (see Yann LeCunn's Tweet, and the paper attached of course). 

- Number of trees (decision tree algorithms)

- Number of iterations (how many times the model goes through the data). Note: Instead of tuning iterations, use early-stopping instead.

- Many more... depends on the algorithm you're using. Search "[aglorithm_name] hyperparameter tuning"


4. Analysis/Evaluation

Evaluation metrics for Classification and Regression type of Problems

Classification Type Problems : Do you want to predict whether something is one thing or another? Such as whether a customer will churn or not churn? Or whether a patient has heart disease or not? Note, there can be more than two things. Two classes is called binary classification, more than two classes is called multi-class classification. Multi-label is when an item can belong to more than one class.

For Classification type problems, Things you should pay attention to

a. False negatives — Model predicts negative, actually positive. In some cases, like email spam prediction, false negatives aren’t too much to worry about. But if a self-driving cars computer vision system predicts no pedestrian when there was one, this is not good.

b. False positives — Model predicts positive, actually negative. Predicting someone has heart disease when they don’t, might seem okay. Better to be safe right? Not if it negatively affects the person’s lifestyle or sets them on a treatment plan they don’t need.

c. True negatives — Model predicts negative, actually negative. This is good.

d. True positives — Model predicts positive, actually positive. This is good.

e. Precision : What proportion of positive predictions were actually correct? A model that produces no false positives has a precision of 1.0.

f. Recall : What proportion of actual positives were predicted correctly? A model that produces no false negatives has a recall of 1.0.

g. F1 Score : A combination of precision and recall. The closer to 1.0, the better.

h. Receiver operating characteristic (ROC) curve & Area under the curve (AUC) : The ROC curve is a plot comparing true positive and false positive rate. The AUC metric is the area under the ROC curve. A model whose predictions are 100% wrong has an AUC of 0.0, one whose predictions are 100% right has an AUC of 1.0.

Regression Type Problems : Do you want to predict a specific number of something? Such as how much a house will sell for? Or how many customers will visit your site next month?

For Regression, Things you should pay attention to

a. MSE (mean squared error) : The average squared difference between the estimated values and the actual value

b. MAE (mean absolute error) : The average difference between your model's predictions and the actual numbers.

c. RMSE (root mean square error) : The square root of the average of squared differences between your model's predictions and the actual numbers.

Task-based metric (create one based on your specific problem) : For example, for self-driving cars, you might want to know number of disengagements.

After evaluation, you have to find features of data as sometimes it also plays an important role

Feature importance : Useful for model explainability, for example, telling someone, the number of bedrooms is most important when predicting a house price.

Types of features you should know

a. Categorical features : One or the other(s). For example, in our heart disease problem, the sex of the patient. Or for an online store, whether or not someone has made a purchase or not.

b. Continuous (or numerical) features : A numerical value such as average heart rate or the number of times logged in.

c. Derived features : Features you create from the data. Often referred to as feature engineering. Feature engineering is how a subject matter expert takes their knowledge and encodes it into the data. You might combine the number of times logged in with timestamps to make a feature called time since last login. Or turn dates from numbers into “is a weekday (yes)” and “is a weekday (no)”.

Bias/variance trade-off : high bias results in underfitting and a lack of generalization to new samples, high variance results in over fitting due to the model finding patterns in the data which is actually random noise.

Now, only two steps left...

5. Serve model (Deploying a model)

- Put the model into production and see how it goes. Evaluation metrics in vivo (in a notebook) are great but until it's in production, you won't know how it performs for real. Tools you can use to deploy your machine learning model

a. TensorFlow Serving (part of TFX, TensorFlow Extended)

b. PyTorch Serving (TorchServe)

c. Google AI Platform: make your model available as a REST API

d. Sagemaker

6. Retrain model

- See how the model performs after serving (or prior to serving) based on various evaluation metrics and revisit the above steps as required (remember, machine learning is like try and learn, so this is where you'll want to track your data and practice).

Hope this post will be helpful to you. If you think this article will be helpful for others, please do spend your important 30 seconds in helping others by sharing this article with them. Also share it with your social friends. And If you think anything is still missing, please tell us at any of our social media. That's not all for now. This post will be updated as we come across new things. Just stay tuned with us because the upcoming article will be either "A curated list of best machine learning tools you should use as a machine learning enthusiast" or "Programming Meme Factory - A curated list of some of the best coding and programming memes".
AI/ML
April 01, 2021
0

Search

Contact Us