[KIT] How to Build a simple prediction model in machine learning

uncategorized

2021.02.13 / Last modified at 2021.03.22

I am an absolute beginner to machine learning and i think this article will be very simple that people who just start in machine learning can understand what i am doing

i just start machine learning so i will apply all the concept that i have learn that relate to regression model,data cleaning,feature engineering in this project

+Importance library that will be use

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

+Load Data from csv file to dataFrame

df=pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")

I have import pandas as pd so pd.read_csv(filepath) that mean i want to convert from a csv file to dataframe

+Exploring data

What i mean by exploring the data is i want to analyze all factor that affect the result also I want to know which column is more importance and which column i should remove from dataframe. Data cleaning and outlier removing also include in this process.

Now let list all the column of a dataframe

k=sorted(list(df.columns))
k

as we see there are 80 columns are indepandance variable and 1 depandance variable(SalePrice) in our DataFrame if we put all this columns into our model our model will not optimize because some of the columns may not determine the price so we need to remove it that will increase our model’s performance

To identified we which are the columns that we should use to train our model i recommend you to look this Notebook: https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

after doing this i’ve seen some solumns that interesting

1.BsmtQual:it is the quality of the basement

2.TotalBsmt: Total square feet of basement area

3.1stFlrSF: First Floor square feet

4.2ndFlrSF: Second floor square feet

5.GrLivArea: Above grade (ground) living area square feet

6.GarageCars:Size of garage in car capacity

7 .KitchenAbvGr: Kitchens above grade

8.TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

9.MiscFeature : Miscellaneous feature not covered in other categories

df1=df[["BsmtQual","TotalBsmtSF","1stFlrSF","2ndFlrSF","GrLivArea","BedroomAbvGr","KitchenAbvGr","YearBuilt","TotRmsAbvGrd","GarageType","GarageCars","MiscFeature","MiscVal","SalePrice"]]
df1

Now I removed all the column that i think it is not importance and only keep the column that i think they are impact the price. This our DataFrame

Data Cleaning and Feature Engineering

Data Cleaning

Data cleaning refers to preparing the data for our model by removing or modify the data that not complete or irrelevant to our model
It help in increasing the accuracy of our model
Most of the time, almost 80% in modeling process is dedicate to data cleaning

Feature Engineering

it is the way we apply our domain knowledge or business knowledge about the data to remove outlier or erorr data. ex: if we know that a room in a house is maximum 300 square feet but in the data is more than that so we will remove from the dataframe
it is also include in data cleaning

Now let see How much data we have:

df1.shape[0]

Now let see How many row that has null value

df1[df1["MiscFeature"].isna()].shape[0

As we see there are a small amount(1406) of data that has Miscellaneous feature so we can consider it as outlier then remove them from our dataframe the increase our model performance

df3=df2.drop(["MiscFeature","MiscVal"],axis="columns"

+Remove outlier

df3.SalePrice.describe()

Output:

count      1406.000000
mean     182046.410384
std       80084.136570
min       34900.000000
25%      130000.000000
50%      164250.000000
75%      215000.000000
max      755000.000000
Name: SalePrice, dtype: float64

df3["price_per_sqt"]=df3.SalePrice/df3.GrLivArea

df3.price_per_sqt.describe()

Output

count    1406.000000
mean      120.947425
std        31.538984
min        28.358738
25%       100.332272
50%       120.344258
75%       139.045487
max       276.250881
Name: price_per_sqt, dtype: float64

+Visualization

it is very importance because as we are human picture get us to understand thing very qucikly
For me i think data visualization is like memes
“if we need to explain our visualization that mean our visualization is not good enough”

To make clear that those columns have a strong affect on the saleprice i am going to visualize those columns with sale price

In [15]:

def year_price(df):
    plt.scatter(df.YearBuilt,df.price_per_sqt)
    plt.xlabel("YearBuilt")
    plt.ylabel("Price_per_sqft")
    plt.legend()

it seem that year built and price has a positive relationship

now let see the graph

    plt.scatter(df3.TotalBsmtSF,df3.price_per_sqt)
    plt.xlabel("Total basement area")
    plt.ylabel("Price_per_sqft")
    plt.rcParams["figure.figsize"] = (15, 10)
    plt.legend()

As we see in the graph there are some outlier so we need to remove it to increase our model performance

import numpy as np
def remove_out(df):
    remove_data=pd.DataFrame()
    for year,year_df in df.groupby("YearBuilt"):
        m=np.mean(year_df.price_per_sqt)
        s=np.std(year_df.price_per_sqt)
        out=year_df[(year_df.price_per_sqt>(m-s)) & (year_df.price_per_sqt<(m+s))]
        remove_data=pd.concat([remove_data,out],ignore_index=True)
    return remove_data
df4=remove_out(df3)
df4