2024-01-02

Housing Price Prediction using Linear Regression

hands on machine learning

25 min read

data science deep logo

INTRODUCTION

In this blog, we will look at the housing dataset and use linear regression to predict the price of a house based on different factors.

Importing Essential Libraries:

#Importing numpy for computations
#Importing matplotlib for visual representation

import numpy as np
import matplotlib.pyplot as plt

Ignoring Warnings:

import warnings
warnings.filterwarnings('ignore')

Importing the numpy,pandas and Visualisation package:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

Reading of the housing data

housing=pd.read_csv('Housing.csv')
housing.head()

Housing-Prediction

DATA INSPECTION:-

#No of rows and columns in the dataset
housing.shape
(545, 13)


#Columns,non null counts and their data type
housing.info()

Housing-Prediction

#Descriptive statistics of data

housing.describe().T

Housing-Prediction

DATA CLEANING:-

#Checking the total null values in each column of the dataset.
housing.isnull().sum()

Housing-Prediction

Outliers Analysis:-


fig, axs = plt.subplots(2,3, figsize = (10,5))
sns.boxplot(housing['price'], ax = axs[0,0])
sns.boxplot(housing['area'], ax = axs[0,1])
sns.boxplot(housing['bedrooms'], ax = axs[0,2])
sns.boxplot(housing['bathrooms'], ax = axs[1,0])
sns.boxplot(housing['stories'], ax = axs[1,1])
sns.boxplot(housing['parking'], ax = axs[1,2])
plt.show()

Housing-Prediction

we can clearly see from the above box plot(which shows the distribution of data column wise) that the price and area has sufficient outliers.

#Outliers treatment for price:-
Q1=housing['price'].quantile(0.25)
Q3=housing['price'].quantile(0.75)
IQR=Q3-Q1
housing=housing[(housing['price']>=Q1-1.5*IQR) & (housing['price']<=Q3+1.5*IQR)]


#Outliers treatment for AREA:-

Q1=housing['area'].quantile(0.25)
Q3=housing['area'].quantile(0.75)
IQR=Q3-Q1
housing=housing[(housing['area']>=Q1-1.5*IQR) & (housing['area']<=Q3+1.5*IQR)]


#Clearly some of the rows in the housing dataset are removed containing outliers of price and area.
housing.shape
(517, 13)

housing.info()

Housing-Prediction

Analysis after treatment of outliers:

plt.boxplot(housing.price)
plt.show()
plt.boxplot(housing.area)
plt.show()

Housing-Prediction

EXPLORATORY DATA ANALYSIS:-

Let us know understand the data:-

1.Checking for multicollinearity among the independent variables.

2.How each independent variables is related to the target variabe.

#Visualizing the numerical variables:-
sns.pairplot(housing)
plt.show()

Housing-Prediction

#Visualizing the categorical variables:-
for i in housing.columns:
  if(housing[i].dtype!=int):
    sns.boxplot(x=i,y='price',data=housing)
    plt.show()

Housing-Prediction

DATA PREPARATION & PREPROCESSING:-

Many of the independent variables are categorical in nature.The linear regression has to find the best fit line via numerical methods. So we need to convert this categorical variables into numerical for doing the regression analysis:-

#From the above box plot we can clearly see that except the column furnishing status,rest all the categorical variables have yes or no values.
#so we give yes as 1 and no as 0.

for i in housing.columns:
  if( (housing[i].dtype!=int) & (i!='furnishingstatus') ):
    housing[i]=housing[i].replace({'yes': 1, "no": 0})
      
housing.head()

Housing-Prediction

#All the categorical variables are converted into numerical values 0 and 1,except furnishingstatus.lets look us at unique value of furnishingstatus. 

housing.furnishingstatus.value_counts()

Housing-Prediction

#We are converting semi-furnished,unfurnished and furnished to numerical variables using 0,1 and 2.
housing['furnishingstatus'].replace({'semi-furnished':0,'unfurnished':1,'furnished':2},inplace=True)

housing.head()

Housing-Prediction

#we will be seeing correlation matrix using heatmap to see how the variables are related with each others.
sns.heatmap(housing.corr(),cmap='Greens')
plt.show()

Housing-Prediction

from sklearn.preprocessing import MinMaxScaler
sc=MinMaxScaler()  #creating an object of MinMax Scaler
num_vars = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking','price']
housing[num_vars]=sc.fit_transform(housing[num_vars])

housing.head()

Housing-Prediction

housing.describe()

Housing-Prediction

PREPARING THE INPUT X AND OUTPUT Y VARIABLES AND SPLITTING THE DATA INTO TRAIN AND TEST SET:-

x=pd.DataFrame(housing.iloc[:,1:])
x

Housing-Prediction

y=pd.DataFrame(housing.iloc[:,0])
y

Housing-Prediction

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.7,random_state=180)

x_train.shape
(361, 12)
x_test.shape
(156, 12)
y_train.shape
(361, 1)
y_test.shape
(156, 1)

Importing the linear regression and training(fit) the model:

from sklearn.linear_model import LinearRegression
linear_model=LinearRegression()
linear_model.fit(x_train,y_train)

Prediction of the output variable y by the model.

y_pred=linear_model.predict(x_test) 


#Y=m1x1+c1+m2x2+c2+m3x3+c3...and so on
#we will get the values of m1 and c1 by using linear_model.coef_ and by linear_model.intercept_. of the linear regression

Finding the linear model coefficients:

coef=linear_model.coef_  
col=housing.columns[1:]
coefficients=pd.DataFrame(coef,columns=col)
coefficients.T

Housing-Prediction

#we can clearly see that the area has the highest impact on the price of the house and neglecting mainroad,guestroom,basement,hotwaterheating and furnishing status as they have insignificant effects on price.

so, price=0.31area+0.2bathrooms+0.16stories+0.12airconditioning

EVALUATION METRICS FOR MODEL PERFOMANCE:-

from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test,y_pred))
print('MSE',metrics.mean_squared_error(y_test,y_pred))
print('RMSE',np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
print('R^2_Score',metrics.r2_score(y_test,y_pred))

Housing-Prediction