2024-01-20

KNN on Telecustomers categorization

hands on machine learning

25 min read

data science deep logo

INTRODUCTION

In this blog, we will analyze the 'Telecustomers' data and use KNN to categorize them into four different categories, namely 1, 2, 3, and 4.

Importing Necessary Packages:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

Reading the data:

df=pd.read_csv('Telecustomers.csv')
df.head()

KNN-on-Telecustomers

Data Inspection:

df.columns

KNN-on-Telecustomers

df.shape
(1000, 12)
df['custcat'].value_counts()

KNN-on-Telecustomers

The categories of the customers seem to be quite balanced.

Data Preparation:

X=df.drop(columns=['custcat'])
X

KNN-on-Telecustomers

y=df['custcat']
y

KNN-on-Telecustomers

Data Preprocessing:

from sklearn import preprocessing 
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X

KNN-on-Telecustomers

Data Splitting:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
X_train.shape
(800, 11)
X_test.shape
(200, 11)

KNN Model Building & Training:

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
K=4  
neigh=KNeighborsClassifier(n_neighbors=k).fit(X_train,y_train) 

Model's Prediction:

y_pred=neigh.predict(X_test)
print('Acuuracy is:',metrics.accuracy_score(y_test,y_pred))
Acuuracy is: 0.32
print(y_pred)

KNN-on-Telecustomers

Error Rate for different K's:

error_rate=[]
for i in range(1,10):
    knn=KNeighborsClassifier(n_neighbors=i).fit(X_train,y_train)
    y_pred=knn.predict(X_test)
    error_rate.append(np.mean(y_test!=y_pred))
print(error_rate)
[0.7, 0.71, 0.685, 0.68, 0.685, 0.69, 0.665, 0.675, 0.66]

Visualising Error Rate VS K:

plt.figure(figsize=(10,6))
plt.plot(range(1,10),error_rate,color='blue', linestyle='dashed', marker='o',markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

KNN-on-Telecustomers

print("Minimum error:-",min(error_rate),"at K =",(error_rate.index(min(error_rate))+1))
Minimum error:- 0.66 at K = 9

Since the error is minimum at K=9, hence the accuracy will be maximum at K=9.

Accuracy for different K's:

acc = []
from sklearn import metrics
for i in range(1,10):
    neigh = KNeighborsClassifier(n_neighbors = i).fit(X_train,y_train)
    yhat = neigh.predict(X_test)
    acc.append(metrics.accuracy_score(y_test, yhat))

Visualising Accuracy VS K:

plt.figure(figsize=(10,6))
plt.plot(range(1,10),acc,color = 'blue',linestyle='dashed', marker='o',markerfacecolor='red', markersize=10)
plt.title('accuracy vs. K Value')
plt.xlabel('K')
plt.ylabel('Accuracy')

KNN-on-Telecustomers

print("Maximum accuracy:-",max(acc),"at K =",acc.index(max(acc))+1)
Maximum accuracy:- 0.34 at K = 9

Conclusion:

K is set to 9 for categorizing any new telecustomers. The KNN algorithm will look at the nearest 9 neighbors, and based on the majority vote, it will assign a category to the new data/telecustomers.