Prediction of profitable zipcodes consisting 2 bedroom properties for investment using Airbnb and Zillow data

Comments · 1942 Views

As a consultant to the real estate company, here is the task of analyzing the properties in the New York City for short term rental options and suggesting the top zip codes to invest in. The company has already concluded that two bedroom properties are the best to invest in as its profitab

Data Source

There are two datasets provided for the analysis:

  1. Revenue data (Airbnb) - Detailed information about the properties situated in New York City.
  2. Cost data (Zillow) - Estimate for two-bedroom properties.

Assumptions

Here are some assumptions that are taken in this analysis:

  • Occupancy rate of 75% for the properties in the New York City.
  • The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).
  • The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).
  • All properties and all square feet within each locale can be assumed to be homogeneous (i.e. a 1000 square foot property in a locale such as Bronx or Manhattan generates twice the revenue and costs twice as much as any other 500 square foot property within that same locale.)
  • Cleaning fees that are charged are always in excess of how much it actually costs for cleaning. 25% of the cleaning fees is considered in calculating annual revenue.

 

Approach for analysis

 

1. Understanding the data and Data Preparation

In order to filter out the most profitable zipcodes to invest in, I have shortlisted the relevant features from the dataset. The features which were relevant to the houses have been considered. Here is the short description of what the features mean

  • id : Identifier used by AirbnB for the listing
  • neighbourhood_group_cleansed : Name of the area where the property is located.
  • state : State where the property is located.
  • zipcode : Zip code where the property is located.
  • property_type : Indicates the type of house is this is considered. Examples are house, apartment, townhomes, etc...
  • room_type : Indicates specific rooms available for rent and/or the entire home.
  • bedrooms : Indicates the number of bedrooms within the property.
  • bed_type : Indicates the type of bed the property has. ie. futon, real bed, airbed, etc.
  • latitude : The angular distance of a place north or south of the earth's equator, expressed in degrees and minutes.
  • longitude : The angular distance of a place east or west of the meridian at Greenwich, England, expressed in degrees and minutes.
  • square_feet : Square footage of the property or space for rent.
  • price : Price the host is charging to stay per night.
  • cleaning_fee : Price the host is charging to clean up after your stay.
  • review_scores_rating : Overall score given based on accuracy, cleanliness, check-in, communication, location, and value.

Feature Extraction

Airbnb Data Overview

  • 'zipcode', 'price' and 'cleaning_fee' are shown are categorical in this dataset.
  • price and cleaning_fee should be converted to numerical datatypes

Missing Values

Inconsistent zipcodes

  • Both Float and String Data types
  • Greater than 5 digits
  • Resulting in Improper join

Correcting zipcodes

  • Converting to string datatype
  • Extracting first 5 digits

Irregularities in price cleaning fee variables

  • Removing unwanted characters such as ‘ $ ’, ‘ , ’
  • Converting to float datatypes

Price distribution

  • Highly right skewed distribution
  • Presence of outliers

  • Right skewed distribution

Removing Outliers filtering relevant data

  • Restricting dataset with rental prices less than $1,000
  • Filtering 2 bedrooms rental houses

Zillow Data Overview

Data type conversion

  • Converting RegionName (zipcode) from integer to string
  • Renaming ‘RegionName’ to ‘zipcode’ and ‘2017-06’ to ‘Median_price’

Join of Airbnb and Zillow datasets

  • Left join
  • Presence of high missing values in Median price

Strategies to tackle missing values of median price

We can see that only 1538 values of Median_price are present out of 6438 values in this dataset. That means only 1/4th of the data is present for Median price. The following things could be done to tackle this problem:

  1. Linear Regression - This could be done by training the data on 1/4th of the data and finding the values for the remaining records. However, this would result in very high error as the training dataset is very limited and the testing dataset is 4 times the size.
  2. KNN Imputation - This could also be an option as we can assign the median price of the neighboring locality using the geo coordinates. However, this could also result in high errors as there are some areas where we dont have data at all. For example, there are no median price readings for the houses in Bronx neighborhood.
  3. Inner join - Instead of left joining the airbnb_set and zillow_set dataset, we can use inner join. Though the dataset will be reduced to 1/4th of it, but there will be no erroneous calculation made.

Inner join

  • High number of missing values in ‘review_scores_rating’ and ‘cleaning_fee’ features

Imputation of cleaning fee

  • Grouping by zipcodes and taking mean cleaning fee to impute.

Discrepancy of neighbourhood

Dropping irrelevant columns

  • Dropped ‘bedrooms’, ‘state’ and ‘review_scores_rating’

2. Exploratory data analysis (EDA)

Analysis of properties in the neighbourhood

  • Manhattan has highest number of properties, followed by Brooklyn, Staten Island and Queens
  • Manhattan has the higher median rental price, followed by Brooklyn, Queens and Staten Island.
  • The interquartile range is also greatest for Manhattan, indicating that there is a lot of variation in the rental prices of properties located in Manhattan. Staten Island on the other end, has the least interquartile range, indicating less variation in the rental prices of properties.

Rental Price Distribution

  • Rental price distribution is right skewed

Rental price in various property types

  • Mean rental price is highest in serviced apartment, loft and condominium.
  • Bungalow has the lowest mean rental price.

Trend between rental price and cleaning fees

3. Solutions

Approach 1: Breakeven Analysis

  • To differentiate the zipcodes that would be profitable to invest in, we need to do breakeven period analysis. Through this, we will basically be calculating the amount of time that would be required to earn back the amount of money we invested in

  • Investment will be the amount of money invested to buy the house. We will be using Median price of the 2 Bedroom houses of the respective zipcode in the formula.
  • Earnings will be the from rent and cleaning fees. We are assuming the occupancy rate to be 75% here. Also, the cleaning fees is included taking into consideration that the owner will be charging extra and not all of it will be going towards cleaning of the apartment. We are taking into consideration that 25% of cleaning fees will be accounted in earnings

Annual Revenue distribution across zipcodes

  • Annual Revenue follows a normal distribution for most of the zipcodes.
  • This means that if we decide to use any machine learning algorithm using this data at a later stage, then the errors will be minimal.

Top zipcodes using BE Analysis

Mean breakeven time across zipcodes

  • Breakeven time sorted in ascending order for zipcodes
  • 11003, 10306,10303, 11434 and 10304 are top 5 zipcodes with least breakeven time

Breakeven Analysis (Tableau)

  • The size of the bubble indicates the median price of the houses in the respective zipcodes.
  • The color of the bubble indicates how fast it would take to approximately breakeven.
  • Green indicates a faster breakeven while red indicates that it would take longer to breakeven.
  • Staten Island entirely consists of dark green bubbles faster breakeven, followed by Queens and Brooklyn.

  • We can see a positive correlation between price and breakeven period.
  •  Zipcodes with lower mean rental prices have faster breakeven period compared with properties with higher mean rental price.

 

Approach 2: Appreciation analysis

In this approach, we will be calculating the appreciation of every zipcode and then selecting the top zipcodes based on its value. The logic behind this approach is that, the zipcodes which gains its value faster will be desirable to invest in. This approach has some underlying assumptions such as:

  • The appreciation rate in the past will stay the same in the future also.
  • The real estate economy will stay the same as it was before. Therefore, there would be no economy decline or crisis in the nearby future.

The appreciation is caculated by the formula given below

Top Zipcodes using appreciation analysis

  • We can see that the zipcodes that we got in the approach 1 are completely different from the one we got in approach 2.
  • This might be due to property prices rising at a faster rate than the rent prices, therefore leading to different results.
  • According to this approach, the profitable zipcodes to invest in are 11217, 11231, 11215, 10128 and 11201

Approach 3: trade off between breakeven appreciation analysis

  • After undergoing both approach 1 and approach 2, we come to a conclusion that both the analysis are important and need to be considered equally.
  • Thus, there is going to be a tradeoff analysis, which means that zipcodes have to be considered equally based on breakeven, i.e. when they will start earning the profit and also based on the reasonable appreciation rate

Selection of zipcodes

Top zipcodes based on tradeoff analysis

  • According to the tradeoff analysis, the zipcodes enclosed in the box of 2 vertical and horizontal lines are the right to invest in.
  • This is because they have the right appreciation value along with breakeven period. The recommended zipcodes to invest in are 11434, 10305, 10303, 10304, 11003

4. Conclusion

  • According to me, approach 3 is the right way to decide on zipcodes to invest in. Therefore, following zipcodes should be considered for investment: 11434, 10305, 10303, 10304, 11003

 

Comments