# Build a classifier to predict label of unknown cases based on known data

A Machine Learning approach using Python (sklearn-KNeighborsClassifier)

Classification is a supervised learning model. Using classification, we can categorize unknown items into a discrete set of categories and “classes”.

**Most common use cases of classification algorithms:**

- loan default classification
*(*Which customers will have problem repaying loans*)* - To predict a category where a customer belong
- Churn detection: To predict if a customer switches to another product/brand
- Email / document classification

In this blog, I will limit to ‘k-Nearest Neighbors’ Algorithm

**k-Nearest Neighbors (k-NN) Algorithm:**

k-NN method classifies cases based on their similarity to other cases, the logic of the algorithm is to identify cases nearer to each other (neighbors)

The distance between two cases is the measure of “dissimilarity / similarity”. There are many options to calculate distance between two cases, most popular among all is “Euclid distance”

Algorithm accuracy is depended on the ** k-value** assigned.

- A low k-value would result in a less accurate model (causes over fitting)
- A high k-value would result in a high accurate model (causes under fitting)

** Solution: **Use

*train/test split*approach to reserve part of data to test model accuracy and pick a k value with highest accuracy

**Data Exploration and Normalization:**

In this blog, I will be using customer segmentation data set by a telecommunications provider. The goal is to predict customer category groups based on usage patterns.

**Data Normalization:**

Data standardization give data zero mean and unit variance. We must perform normalization since k-NN algorithm is based on distance of cases.

**Split data into train and test sets (Mutually exclusive):**

I would recommend using mutually exclusive, since this approach would increase “Out-of-sample accuracy”

**How to choose a good k-value:**

Finding a good k-value is an iterative process. To pick a good k-value, we need to measure mean accuracy, standard deviation accuracy and F-1 Score (Confusion Matrix)

**Build k-Nearest Neighbors (k-NN) Classifier:**

Since we estimated k = 9 gives best accuracy for the training and test data, we would use k value to build our case segmentation classifier.

Now we have a classifier ready to be applied to unknown data …

Happy Coding!!

You may find more code repositories on Machine Learning and AI algorithms at my Git Hub https://github.com/krishnakesari