Build a classifier to predict label of unknown cases based on known data

A Machine Learning approach using Python (sklearn-KNeighborsClassifier)

Classification is a supervised learning model. Using classification, we can categorize unknown items into a discrete set of categories and “classes”.

Most common use cases of classification algorithms:

  1. loan default classification (Which customers will have problem repaying loans)
  2. To predict a category where a customer belong
  3. Churn detection: To predict if a customer switches to another product/brand
  4. Email / document classification

In this blog, I will limit to ‘k-Nearest Neighbors’ Algorithm

k-Nearest Neighbors (k-NN) Algorithm:

k-NN method classifies cases based on their similarity to other cases, the logic of the algorithm is to identify cases nearer to each other (neighbors)

The distance between two cases is the measure of “dissimilarity / similarity”. There are many options to calculate distance between two cases, most popular among all is “Euclid distance”

Algorithm accuracy is depended on the k-value assigned.

  1. A low k-value would result in a less accurate model (causes over fitting)
  2. A high k-value would result in a high accurate model (causes under fitting)

Solution: Use train/test split approach to reserve part of data to test model accuracy and pick a k value with highest accuracy

Data Exploration and Normalization:

In this blog, I will be using customer segmentation data set by a telecommunications provider. The goal is to predict customer category groups based on usage patterns.

Income distribution among customers

Data Normalization:

Data standardization give data zero mean and unit variance. We must perform normalization since k-NN algorithm is based on distance of cases.

Split data into train and test sets (Mutually exclusive):

I would recommend using mutually exclusive, since this approach would increase “Out-of-sample accuracy”

How to choose a good k-value:

Finding a good k-value is an iterative process. To pick a good k-value, we need to measure mean accuracy, standard deviation accuracy and F-1 Score (Confusion Matrix)

Build k-Nearest Neighbors (k-NN) Classifier:

Since we estimated k = 9 gives best accuracy for the training and test data, we would use k value to build our case segmentation classifier.

Now we have a classifier ready to be applied to unknown data …

Happy Coding!!

You may find more code repositories on Machine Learning and AI algorithms at my Git Hub




Data Scientist, The World Bank. Originally I was a Molecular Biologist, then I became a Bureaucrat and a Data Scientist

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Famous and commonly used programming languages and salary of programmers in Bangladesh

NFL Fantasy — Weeks 7&8 — Powered by Athstat!

How YOU can improve your Airbnb offer

EIJAL scoring — English is just another language

Truth is, now is OSRS gold as good as it gets, and it is pretty damn great outside of some folks…

How to Generate Correlated Data in R

Data Processing Programming (1): Introduction

Graph paper, digital roads, the quirks of proteomes: the future of data

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Krishna K

Krishna K

Data Scientist, The World Bank. Originally I was a Molecular Biologist, then I became a Bureaucrat and a Data Scientist

More from Medium

Pipeline in Sklearn: an efficient method to assemble multiply steps and configure parameters

Vector autoregression forecast on chemical data of The Antwerp Maritime Academy


Machine learning scholar adventure: Chapter 5

Applying Naive Bayes algorithm on play tennis dataset