Introduction
A home is often the most expensive purchase people make. There are so many features to be considered while buying a home. Washington DC is the capital of the United States of America. If you are thinking about moving to the nation’s capital, you are probably wondering whether you can afford to settle down there. Home prices in the District are notoriously high. The population is around 702,455 people as of today and has only been increasing since 2002. According to the National Association of Realtors, the median sales price of a single-family home in the Washington, D.C. metro area is $417,400. That is more expensive than both New York City ($403,900) and Philadelphia ($224,600).
Figure 1: Properties in DC
Problem Statement
With this data in hand, I am trying to accomplish the following tasks –
- Perform descriptive statistics of the data
- Perform exploratory data analysis to understand the behavior of various features (X variables) and how they have an impact on Price (Y variable).
- Predict the Price of the residential using Regression models and Neural Network.
Data
The dataset has 158957 rows and 48 columns.
I am predicting the Price of the house.
Therefore, Price is the target variable (y) and all other variables are features (X variables).
Figure 2: The structure of the dataset
Overview of the Process Flow
Figure 3: Overview of the process flow
Descriptive Statistics
Distribution of Price
Figure 4: Histogram of target variable “Price”
Descriptive statistics
To better explore the mean, minimum and maximum prices, I have performed descriptive statistics on the price column and applied a lambda function to show the statistics in simple numbers as opposed to scientific notation.
Figure 5: Descriptive statistics of target variable “Price”
Inference: From both the above figures it can be observed that the property prices are right-skewed with a mean of $654,932. Most houses are in the range of 100k to 250k; the high end is around 250 to 700k with a sparse distribution. We can also find some outliers here which I will be addressing during the pre-processing stage.
Exploratory Data Analysis
Before diving into building models, I have tried to explore the variables through some plots and tried to infer a few insights on the relationship between price and other independent variables.
Plotting the location using Folium Map
Figure 6: Location of the properties in DC
Sales over Time
To begin with, I wanted to check how have the sales been over the years. So, I have plotted a line graph to check the sales over time (1986-2018) and it can be observed that over the years the sales have been decreasing.
Figure 7: Line graph of years vs the number of residentials
Price over Time
After plotting the sales, I observed that the sales in 2018 has not been great and is been decreasing. So, to check if Price has anything to do with the decrease in sales, I have plotted a line graph of Price vs Years (1986-2018). And from the below graph it can be observed that, prices have only been increasing over the years. The last noted was for 2018 with an average price of $860.6289k. So, this explains why the sales have been decreasing in the last couple of years.
Figure 8: insight on price change over the years (1986-2018)
Price vs Quadrants
The area in Washington DC is divided into several Quadrant. Each Quadrant is further divided into Ward and each Ward is divided into Assessment Neighborhood. Out of all the four quadrants, the highest number of sales come from the North-Western quadrant with a count of 13832 sales from 1986-2018 and an average price of $694,502. It can be observed that the price and quadrant are directly proportional to each other.
Figure 9: Distribution of Price (left-image) and number of Residentials (right-image) based on Quadrants.
Price vs Wards
Each quadrant is divided into WARDS
Figure 10: Number of Residentials (left-image) and Distribution of Price (right-image) based on Wards.
Inference: WARD 6 has the highest number of sales because it can be observed in the figure right of figure 2, that ward 6 has average pricing of $616,146 which is less compared to the ward 2 average pricing of $906,298. I have also plotted a boxplot to understand the distribution of prices with respect to different wards and it can clearly be seen that Ward 2 is the costliest among all others. Refer to the figure below.
Figure 11: Distribution of Price based on Wards.
PRICE vs STYLE
Next, I wanted to check if the variable “Style” has any impact on my dependent variable, i.e. Price. So, I plotted a bar graph of Price vs Style and it can be observed that except for a few types there is not much difference in prices for different styles.
Figure 12: Number of Residentials (left-image) and Distribution of Price (right-image) based on Style.
Top 10 Residentials ordered by neighborhood
Based on the number of sales from 1992-2018, I have plotted a graph of most preferred Neighborhoods in the city. It can be inferred from the below figure that people prefer to stay in Old City1 compared to other neighborhoods.
Figure 13: Top 10 Residentials based on Neighborhood
Data Preprocessing
Dealing with many dirty features is always a challenge. This section focuses on the data preprocessing which is divided into 5 parts.
- Checking for Null Values
- Checking for outliers in PRICE column which is my target variable
- Normalize numerical values
- Handling Categorical values
- Feature Selection (finding optimal columns to build the model)
- Splitting data into train set and test set
Checking for Null Values
The NULL is the term used to represent a missing value. A NULL value in a table is a value in a field that appears to be blank. Most data science algorithms do not tolerate nulls (missing values). So, one must do something to eliminate them, before or while analyzing a data set.
By using the isNull() function, I checked for null values in the dataset and found NONE.
Figure 15: Checking for missing values
Checking for outliers in “PRICE” which is our target variable
Outliers are extreme values that deviate from other observations on data, they may indicate variability in measurement, experimental errors, or a novelty. In other words, an outlier is an observation that diverges from an overall pattern on a sample. Outliers do not affect the dataset. They affect the learning task you may want to do with the dataset. Therefore, it's necessary to eliminate them. On the left-hand side of the figure below, I have checked for outliers and removed them on the right-hand side of the figure.
Figure 16: Checking for outliers (left-image) and removing them (right-image)
Normalization
Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. If normalization is not performed, the target variable will be influenced by the larger value variables. I have used MinMaxScaler() to normalize the numerical variables. This estimator scales and translates each feature individually such that it is in the given range on the training set, i.e. between 0 and 1.
Figure 17: Normalizing Numerical variables
Handling Categorical values
Categorical variables take on values that are names or labels. A lot of machine learning algorithms accept only numerical data as input. To satisfy the norm, I have performed label encoding on all the categorical variables, and the output is shown below.
Figure 18: Label encoding Categorical Variables
Feature Selection
Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested. Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression. Therefore, I have analyzed the relationship of “Price” with other features by using the heatmap of the correlation matrix and selected the top 10 features that affect the “Price”.
Figure 19: Code for feature selection
Figure 20: a heatmap of the correlation matrix
Among the top 10 selected features, variable Latitude and variable Y means the same. Therefore, I am dropping the Y variable and choosing to keep the only Latitude.
The top features are -
Feature | Description |
EYB | The year improvement was built more recent than actual year built |
GRADE | Structural Grade |
FIREPLACES | Number of fireplaces |
BATHRM | Number of Full Bathrooms |
GBA | Gross building area in sq feet |
CNDTN | Condition |
LATITUDE | Latitude (location of the area) |
HF_BATHROOM | Number of Half Bathrooms |
BEDRM | Number of Bedrooms |
Table 1: Top features to build the model
Train-Test Split
I have split the data into two subsets: training and testing data. I have kept 70% of the data for training and 30% of the data for testing. I will fit the model on the train data and make predictions on the test data.
Building the model
Regression Models
I chose to use linear regression, SVR, regression tree, random forest, and gradient boosting to gauge which model would give me the best predictions for this problem.
Target variable (y) – PRICE
Features (X) – PRICE, GBA, FIREPLACES, EYB, BATHRM, LANDAREA, BEDRM, GRADE, HF_BATHRM, LATITUDE.
I created a loop to run through each model, train the data, and then make predictions for each model. I then plot the R-squared value obtained from the models to indicate how good a fit each model is.
Figure 21: R-2 scores for all models
Table 2: Models and their respective r2-score
Inference: Random Forest Regressor outperforms all other models with an accuracy of 91.6% on training data and about 53% on the test data.
Neural-Network
“A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. Neural networks can adapt to changing input; so, the network generates the best possible result without needing to redesign the output criteria.[2]”
I have done the following preprocessing before building the model –
- Read CSV and convert the categorical data by label encoding
- Split data into features (X) and target (Y)
I have built a Sequential Neural Network with three Dense layers, the first with 9, the second with 10 neurons, and the third with 5 neurons, all using a ReLU (Rectified Linear Unit) activation function. And another dense layer for the output layer. I have used the mean squared error loss function to gauge the performance of the model.
Figure 22: Sequential Neural Network
I have then used a test sample by extracting the values from the first row and tried to predict the price of the property. The predicted value is close to the original value.
Figure 23: Results for the model
Future work
- Try to predict the area in Washington DC people would like to reside using the Classification predictive modeling problem.
- Research more on feature engineering
- Apply clustering analysis to create new features
- Use different feature selection methods for different models.
- Build advanced regression techniques like Bayesian Ridge Algorithm or Elastic Net regression technique.
References
- https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e
- https://www.investopedia.com/terms/n/neuralnetwork.asp
- https://www.datacamp.com/community/tutorials/categorical-data
- https://pbpython.com/categorical-encoding.html
- https://plotly.com/python/bar-charts/
- https://www.youtube.com/watch?v=gHXy-qerHj4