Decision trees in machine learning display the stepwise process that the model uses to break down the dataset into smaller and smaller subsets of data eventually resulting in a prediction. The decision trees are categorized under supervised learning and can be used for both classification and regression problems. As with other supervised learning models, the predictions are made based on a set of feature variables and a predetermined threshold, when it comes to decision trees multiple thresholds can be used however in this article we will focus on Entropy and The Gini Index.


As mentioned above entropy is one…

The end goal of machine learning is to create a model that can solve a business problem or adequately address a particular question. The model in this case is a general function with defined parameters that it takes in to make predictions.

When tackling a machine learning problem it is advised that one should include a validation step, this step uses validation data that has been set aside to assess how the model behaves with unseen data. The model performance on the validation data allows us to fine tune the model parameters. …

One of the aims of database design is to try and eliminate redundancies in a relational model. This redundancies may not be fully eliminated but having a proper design may reduce instances of redundancy.

Redundancies may be introduced during some of the more common query operations such as updates, insertions and deletions. Note that the above queries result in new data being introduced in the database or data that was previously in the database being removed. Data normalization methods include: First Normal Form, Second Normal Form, Third Normal Form, Boyce Codd Normal Form, Fourth Normal Form and Fifth Normal Form

As we know part of the feature engineering process is dealing with missing values in our features, these missing values have an impact on prediction ability of our model. In some cases the missing data may in itself may informative to us.

Depending on the gravity of the missing data we have various avenues we can address the problem. These solutions are largely dependent on the type of variables in our data and the quantity of missing values. We will address these two simultaneously.

Imputation is one of the methods for dealing with missing values; imputation simply means replacing the…

To understand what the bias-variance trade-off we must first have a concrete idea of the two elements, this understanding will create a platform to address the issues caused by bias and variance.


When creating models the training data that we use has a big impact on the predicted results, in order to make the predictions a generalizations and simplified assumptions will be made by the model. This is done so that the models can easily make the predictions.

The above simplification may result in large or minor difference between our actual and predicted values, this effect is what we call…

The abundance of different classification models we need to have a methods to assess the validity of these models in their various use cases. As a reminder some of the more popular classification algorithms include: Support Vector Machines (SVM), Linear Regression, Naïve Bayes, Decision Trees and Random Forest.

Depending on the aim of a classification problem and its parameters for example binary vs multiple classification one model may perform better than the other. If this is the case, then how do we come to the conclusion that one model outperforms another, one solution for this is a confusion matrix.


Naftal Teddy Kerecha

I have a strong interest in the field of data science, currently I am skilling up my knowledge in Data Engineering and Cloud related concepts.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store